Last updated

Document Processing

Upload documents and let Papr's intelligent analysis automatically extract memories with knowledge graphs.

Overview

The document processing endpoint allows you to upload PDFs and Word documents. The system analyzes the content, decides what information is worth remembering, and creates structured memories with entities and relationships automatically extracted.

How It Works

  1. Upload - Send your PDF or Word document via the API
  2. Intelligent Analysis - System analyzes content and identifies important information
  3. Selective Memory Creation - Creates memories from significant content with hierarchical structure
  4. Entity Extraction - Identifies entities (people, companies, concepts, etc.)
  5. Relationship Mapping - Connects related entities in the knowledge graph
  6. Schema Guidance - Custom schemas (if provided) guide what to extract

Supported Formats

  • PDF documents (.pdf)
  • Microsoft Word documents (.docx, .doc)

Processing Providers

Papr uses multiple AI providers for document processing with automatic fallback:

  • TensorLake.ai - High-quality structured extraction
  • Reducto AI - Advanced document understanding
  • Gemini Vision - Fallback provider for reliability

The system automatically selects the best provider and falls back if needed, ensuring reliable processing.

Basic Usage

Python

from papr_memory import Papr
import os

client = Papr(x_api_key=os.environ.get("PAPR_MEMORY_API_KEY"))

# Upload a document
response = client.document.upload(
    file=open("contract.pdf", "rb"),
    hierarchical_enabled=True,  # Create hierarchical memory structure
    simple_schema_mode=True  # Use system schema + one custom schema
)

# Get the upload ID
upload_id = response.document_status.upload_id
print(f"Document uploaded: {upload_id}")

# Check processing status
status = client.document.get_status(upload_id)
print(f"Processing: {status.progress * 100}%")
print(f"Current page: {status.current_page}/{status.total_pages}")

TypeScript

import Papr from '@papr/memory';
import fs from 'fs';

const client = new Papr({
  xAPIKey: process.env.PAPR_MEMORY_API_KEY
});

// Upload a document
const response = await client.document.upload({
  file: fs.createReadStream('contract.pdf'),
  hierarchical_enabled: true,
  simple_schema_mode: true
});

// Get the upload ID
const uploadId = response.document_status.upload_id;
console.log(`Document uploaded: ${uploadId}`);

// Check processing status
const status = await client.document.getStatus(uploadId);
console.log(`Processing: ${status.progress * 100}%`);
console.log(`Current page: ${status.current_page}/${status.total_pages}`);

cURL

curl -X POST https://memory.papr.ai/v1/document \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "X-Client-Type: curl" \
  -F "file=@contract.pdf" \
  -F "hierarchical_enabled=true" \
  -F "simple_schema_mode=true"

# Check status
curl -X GET https://memory.papr.ai/v1/document/status/{upload_id} \
  -H "X-API-Key: YOUR_API_KEY"

Using Custom Schemas

Define a custom schema for your domain to guide what entities and relationships the system extracts.

# First, create a custom schema (see Custom Schemas guide for details)
schema = client.schemas.create(
    name="Legal Contract Schema",
    description="Schema for legal contract analysis",
    node_types={
        "Contract": {
            "name": "Contract",
            "label": "Contract",
            "properties": {
                "title": {"type": "string", "required": True},
                "value": {"type": "float", "required": False},
                "status": {
                    "type": "string",
                    "enum_values": ["draft", "active", "expired"],
                    "default": "draft"
                }
            },
            "required_properties": ["title"],
            "unique_identifiers": ["title"]
        },
        "Party": {
            "name": "Party",
            "label": "Party",
            "properties": {
                "name": {"type": "string", "required": True},
                "role": {"type": "string", "required": False}
            },
            "required_properties": ["name"],
            "unique_identifiers": ["name"]
        }
    },
    relationship_types={
        "PARTY_TO": {
            "name": "PARTY_TO",
            "allowed_source_types": ["Party"],
            "allowed_target_types": ["Contract"]
        }
    }
)

schema_id = schema.data.id

# Upload document with custom schema
response = client.document.upload(
    file=open("legal_contract.pdf", "rb"),
    schema_id=schema_id,  # Use your custom schema
    simple_schema_mode=True,
    hierarchical_enabled=True
)

Property Overrides

Use property overrides to ensure consistent entity IDs across your knowledge graph.

response = client.document.upload(
    file=open("contract.pdf", "rb"),
    schema_id="legal_contract_schema",
    simple_schema_mode=True,
    property_overrides=[
        {
            "nodeLabel": "Contract",
            "match": {"title": "Service Agreement 2024"},
            "set": {
                "id": "contract_sa_2024",
                "status": "active",
                "department": "legal"
            }
        },
        {
            "nodeLabel": "Party",
            "match": {"name": "Acme Corp"},
            "set": {
                "id": "party_acme",
                "type": "vendor"
            }
        }
    ]
)

This ensures that even if "Acme Corp" is mentioned multiple times or slightly differently across documents, it's always mapped to the same entity in your graph.

Webhook Notifications

For long-running document processing, you can provide a webhook URL to be notified when processing completes.

response = client.document.upload(
    file=open("large_document.pdf", "rb"),
    hierarchical_enabled=True,
    webhook_url="https://your-app.com/webhooks/document-complete",
    webhook_secret="your_webhook_secret"  # Optional: for authentication
)

When processing completes, Papr will POST to your webhook with:

{
  "upload_id": "upload_abc123",
  "status": "completed",
  "memory_items": [
    {"memoryId": "mem_123", "objectId": "obj_456", "createdAt": "2024-03-21T10:00:00Z"}
  ],
  "total_pages": 10,
  "total_memories": 15
}

If you provided a webhook_secret, it will be included in the X-Webhook-Secret header.

Processing Status

Document processing happens asynchronously. Check the status to see progress:

status = client.document.get_status(upload_id)

print(f"Status: {status.status_type}")  # processing, completed, failed
print(f"Progress: {status.progress * 100}%")
print(f"Page: {status.current_page}/{status.total_pages}")
print(f"Filename: {status.current_filename}")

if status.status_type == "completed":
    print("Document processed successfully!")
elif status.status_type == "failed":
    print(f"Processing failed: {status.error}")

Status types:

  • queued - Document is queued for processing
  • processing - Currently processing
  • completed - Processing finished successfully
  • failed - Processing failed (check error field)
  • cancelled - Processing was cancelled

Cancelling Processing

Cancel document processing if needed:

response = client.document.cancelProcessing(upload_id)
print(f"Cancelled: {response}")

Multi-Tenant Document Scoping

For multi-tenant applications, scope documents to specific users or organizations:

response = client.document.upload(
    file=open("user_document.pdf", "rb"),
    user_id="user_abc123",  # Associate with specific user
    hierarchical_enabled=True
)

# Or use external user IDs
response = client.document.upload(
    file=open("user_document.pdf", "rb"),
    end_user_id="external_user_456",  # Your application's user ID
    hierarchical_enabled=True
)

# For organization-level documents
response = client.document.upload(
    file=open("org_document.pdf", "rb"),
    namespace="org_123",  # Organization or namespace
    hierarchical_enabled=True
)

Real-Time Status Updates

For real-time updates during processing, use WebSocket connections (consult API documentation for WebSocket endpoint details).

Best Practices

1. Use Simple Schema Mode in Production

# Recommended: system schema + one custom schema
response = client.document.upload(
    file=open("document.pdf", "rb"),
    simple_schema_mode=True,  # More consistent results
    schema_id="your_schema_id"  # Optional: specify which custom schema
)

This provides better consistency between document processing and direct memory creation.

2. Enable Hierarchical Structure

response = client.document.upload(
    file=open("document.pdf", "rb"),
    hierarchical_enabled=True  # Create parent-child memory structure
)

Hierarchical structure preserves document organization and makes it easier to navigate memories.

3. Use Property Overrides for Key Entities

For important entities that appear across multiple documents, use property overrides to ensure consistent IDs:

property_overrides=[
    {
        "nodeLabel": "Company",
        "match": {"name": "Acme Corp"},
        "set": {"id": "company_acme", "verified": True}
    }
]

4. Poll Status for Completion

For documents without webhooks, poll the status endpoint:

import time

upload_id = response.document_status.upload_id

while True:
    status = client.document.get_status(upload_id)
    
    if status.status_type in ["completed", "failed", "cancelled"]:
        break
    
    print(f"Progress: {status.progress * 100}%")
    time.sleep(5)  # Poll every 5 seconds

if status.status_type == "completed":
    print("Processing complete!")

5. Handle Large Documents

For large documents (>100 pages), consider:

  • Using webhook notifications instead of polling
  • Processing during off-peak hours
  • Breaking into smaller chunks if possible

Common Use Cases

# Upload legal contracts with legal schema
response = client.document.upload(
    file=open("service_agreement.pdf", "rb"),
    schema_id="legal_contract_schema",
    simple_schema_mode=True,
    property_overrides=[
        {
            "nodeLabel": "Contract",
            "set": {"department": "legal", "status": "active"}
        }
    ]
)

Research Paper Ingestion

# Upload research papers with academic schema
response = client.document.upload(
    file=open("research_paper.pdf", "rb"),
    schema_id="academic_research_schema",
    hierarchical_enabled=True,
    simple_schema_mode=True
)

Product Specification Processing

# Upload product specs with product schema
response = client.document.upload(
    file=open("product_spec.pdf", "rb"),
    schema_id="product_schema",
    simple_schema_mode=True,
    property_overrides=[
        {
            "nodeLabel": "Product",
            "match": {"name": "Widget Pro"},
            "set": {"id": "product_widget_pro", "status": "active"}
        }
    ]
)

Troubleshooting

Processing Takes Too Long

  • Check document size and page count
  • Verify provider availability (system automatically falls back)
  • Consider webhook notifications for large documents

Extraction Not Finding Expected Entities

  • Ensure custom schema is properly defined
  • Check property descriptions are clear and LLM-friendly
  • Verify required properties are marked correctly
  • Try manual graph generation for critical entities

Duplicate Entities Created

  • Use unique_identifiers in your schema
  • Add property overrides for key entities
  • Consider using enums for controlled vocabularies

Next Steps