Document Processing
Upload documents and let Papr's intelligent analysis automatically extract memories with knowledge graphs.
Overview
The document processing endpoint allows you to upload PDFs and Word documents. The system analyzes the content, decides what information is worth remembering, and creates structured memories with entities and relationships automatically extracted.
How It Works
- Upload - Send your PDF or Word document via the API
- Intelligent Analysis - System analyzes content and identifies important information
- Selective Memory Creation - Creates memories from significant content with hierarchical structure
- Entity Extraction - Identifies entities (people, companies, concepts, etc.)
- Relationship Mapping - Connects related entities in the knowledge graph
- Schema Guidance - Custom schemas (if provided) guide what to extract
Supported Formats
- PDF documents (
.pdf) - Microsoft Word documents (
.docx,.doc)
Processing Providers
Papr uses multiple AI providers for document processing with automatic fallback:
- TensorLake.ai - High-quality structured extraction
- Reducto AI - Advanced document understanding
- Gemini Vision - Fallback provider for reliability
The system automatically selects the best provider and falls back if needed, ensuring reliable processing.
Basic Usage
Python
from papr_memory import Papr
import os
client = Papr(x_api_key=os.environ.get("PAPR_MEMORY_API_KEY"))
# Upload a document
response = client.document.upload(
file=open("contract.pdf", "rb"),
hierarchical_enabled=True, # Create hierarchical memory structure
simple_schema_mode=True # Use system schema + one custom schema
)
# Get the upload ID
upload_id = response.document_status.upload_id
print(f"Document uploaded: {upload_id}")
# Check processing status
status = client.document.get_status(upload_id)
print(f"Processing: {status.progress * 100}%")
print(f"Current page: {status.current_page}/{status.total_pages}")TypeScript
import Papr from '@papr/memory';
import fs from 'fs';
const client = new Papr({
xAPIKey: process.env.PAPR_MEMORY_API_KEY
});
// Upload a document
const response = await client.document.upload({
file: fs.createReadStream('contract.pdf'),
hierarchical_enabled: true,
simple_schema_mode: true
});
// Get the upload ID
const uploadId = response.document_status.upload_id;
console.log(`Document uploaded: ${uploadId}`);
// Check processing status
const status = await client.document.getStatus(uploadId);
console.log(`Processing: ${status.progress * 100}%`);
console.log(`Current page: ${status.current_page}/${status.total_pages}`);cURL
curl -X POST https://memory.papr.ai/v1/document \
-H "X-API-Key: YOUR_API_KEY" \
-H "X-Client-Type: curl" \
-F "file=@contract.pdf" \
-F "hierarchical_enabled=true" \
-F "simple_schema_mode=true"
# Check status
curl -X GET https://memory.papr.ai/v1/document/status/{upload_id} \
-H "X-API-Key: YOUR_API_KEY"Using Custom Schemas
Define a custom schema for your domain to guide what entities and relationships the system extracts.
# First, create a custom schema (see Custom Schemas guide for details)
schema = client.schemas.create(
name="Legal Contract Schema",
description="Schema for legal contract analysis",
node_types={
"Contract": {
"name": "Contract",
"label": "Contract",
"properties": {
"title": {"type": "string", "required": True},
"value": {"type": "float", "required": False},
"status": {
"type": "string",
"enum_values": ["draft", "active", "expired"],
"default": "draft"
}
},
"required_properties": ["title"],
"unique_identifiers": ["title"]
},
"Party": {
"name": "Party",
"label": "Party",
"properties": {
"name": {"type": "string", "required": True},
"role": {"type": "string", "required": False}
},
"required_properties": ["name"],
"unique_identifiers": ["name"]
}
},
relationship_types={
"PARTY_TO": {
"name": "PARTY_TO",
"allowed_source_types": ["Party"],
"allowed_target_types": ["Contract"]
}
}
)
schema_id = schema.data.id
# Upload document with custom schema
response = client.document.upload(
file=open("legal_contract.pdf", "rb"),
schema_id=schema_id, # Use your custom schema
simple_schema_mode=True,
hierarchical_enabled=True
)Property Overrides
Use property overrides to ensure consistent entity IDs across your knowledge graph.
response = client.document.upload(
file=open("contract.pdf", "rb"),
schema_id="legal_contract_schema",
simple_schema_mode=True,
property_overrides=[
{
"nodeLabel": "Contract",
"match": {"title": "Service Agreement 2024"},
"set": {
"id": "contract_sa_2024",
"status": "active",
"department": "legal"
}
},
{
"nodeLabel": "Party",
"match": {"name": "Acme Corp"},
"set": {
"id": "party_acme",
"type": "vendor"
}
}
]
)This ensures that even if "Acme Corp" is mentioned multiple times or slightly differently across documents, it's always mapped to the same entity in your graph.
Webhook Notifications
For long-running document processing, you can provide a webhook URL to be notified when processing completes.
response = client.document.upload(
file=open("large_document.pdf", "rb"),
hierarchical_enabled=True,
webhook_url="https://your-app.com/webhooks/document-complete",
webhook_secret="your_webhook_secret" # Optional: for authentication
)When processing completes, Papr will POST to your webhook with:
{
"upload_id": "upload_abc123",
"status": "completed",
"memory_items": [
{"memoryId": "mem_123", "objectId": "obj_456", "createdAt": "2024-03-21T10:00:00Z"}
],
"total_pages": 10,
"total_memories": 15
}If you provided a webhook_secret, it will be included in the X-Webhook-Secret header.
Processing Status
Document processing happens asynchronously. Check the status to see progress:
status = client.document.get_status(upload_id)
print(f"Status: {status.status_type}") # processing, completed, failed
print(f"Progress: {status.progress * 100}%")
print(f"Page: {status.current_page}/{status.total_pages}")
print(f"Filename: {status.current_filename}")
if status.status_type == "completed":
print("Document processed successfully!")
elif status.status_type == "failed":
print(f"Processing failed: {status.error}")Status types:
queued- Document is queued for processingprocessing- Currently processingcompleted- Processing finished successfullyfailed- Processing failed (checkerrorfield)cancelled- Processing was cancelled
Cancelling Processing
Cancel document processing if needed:
response = client.document.cancelProcessing(upload_id)
print(f"Cancelled: {response}")Multi-Tenant Document Scoping
For multi-tenant applications, scope documents to specific users or organizations:
response = client.document.upload(
file=open("user_document.pdf", "rb"),
user_id="user_abc123", # Associate with specific user
hierarchical_enabled=True
)
# Or use external user IDs
response = client.document.upload(
file=open("user_document.pdf", "rb"),
end_user_id="external_user_456", # Your application's user ID
hierarchical_enabled=True
)
# For organization-level documents
response = client.document.upload(
file=open("org_document.pdf", "rb"),
namespace="org_123", # Organization or namespace
hierarchical_enabled=True
)Real-Time Status Updates
For real-time updates during processing, use WebSocket connections (consult API documentation for WebSocket endpoint details).
Best Practices
1. Use Simple Schema Mode in Production
# Recommended: system schema + one custom schema
response = client.document.upload(
file=open("document.pdf", "rb"),
simple_schema_mode=True, # More consistent results
schema_id="your_schema_id" # Optional: specify which custom schema
)This provides better consistency between document processing and direct memory creation.
2. Enable Hierarchical Structure
response = client.document.upload(
file=open("document.pdf", "rb"),
hierarchical_enabled=True # Create parent-child memory structure
)Hierarchical structure preserves document organization and makes it easier to navigate memories.
3. Use Property Overrides for Key Entities
For important entities that appear across multiple documents, use property overrides to ensure consistent IDs:
property_overrides=[
{
"nodeLabel": "Company",
"match": {"name": "Acme Corp"},
"set": {"id": "company_acme", "verified": True}
}
]4. Poll Status for Completion
For documents without webhooks, poll the status endpoint:
import time
upload_id = response.document_status.upload_id
while True:
status = client.document.get_status(upload_id)
if status.status_type in ["completed", "failed", "cancelled"]:
break
print(f"Progress: {status.progress * 100}%")
time.sleep(5) # Poll every 5 seconds
if status.status_type == "completed":
print("Processing complete!")5. Handle Large Documents
For large documents (>100 pages), consider:
- Using webhook notifications instead of polling
- Processing during off-peak hours
- Breaking into smaller chunks if possible
Common Use Cases
Legal Contract Analysis
# Upload legal contracts with legal schema
response = client.document.upload(
file=open("service_agreement.pdf", "rb"),
schema_id="legal_contract_schema",
simple_schema_mode=True,
property_overrides=[
{
"nodeLabel": "Contract",
"set": {"department": "legal", "status": "active"}
}
]
)Research Paper Ingestion
# Upload research papers with academic schema
response = client.document.upload(
file=open("research_paper.pdf", "rb"),
schema_id="academic_research_schema",
hierarchical_enabled=True,
simple_schema_mode=True
)Product Specification Processing
# Upload product specs with product schema
response = client.document.upload(
file=open("product_spec.pdf", "rb"),
schema_id="product_schema",
simple_schema_mode=True,
property_overrides=[
{
"nodeLabel": "Product",
"match": {"name": "Widget Pro"},
"set": {"id": "product_widget_pro", "status": "active"}
}
]
)Troubleshooting
Processing Takes Too Long
- Check document size and page count
- Verify provider availability (system automatically falls back)
- Consider webhook notifications for large documents
Extraction Not Finding Expected Entities
- Ensure custom schema is properly defined
- Check property descriptions are clear and LLM-friendly
- Verify required properties are marked correctly
- Try manual graph generation for critical entities
Duplicate Entities Created
- Use
unique_identifiersin your schema - Add property overrides for key entities
- Consider using enums for controlled vocabularies
Next Steps
- Custom Schemas - Define domain ontologies
- Graph Generation - Control knowledge graph creation
- GraphQL Analysis - Query document insights
- API Reference - Complete endpoint documentation