Content Ingestion
This guide explains how to ingest different types of content into Papr Memory, including text, documents, and code snippets.
Overview
Papr Memory supports various content types to build a comprehensive memory system:
- Text-based memories (notes, conversations, json, etc.)
- Documents (PDF, HTML, TXT)
- Code snippets (with language detection)
Memory Types
Papr Memory supports the following memory types:
text
- Plain text content like notes, conversations, or meeting summaries (also supports JSON content)code_snippet
- Programming code with language detectiondocument
- Document content extracted from files such as PDF, HTML, or TXT
Text Memory Ingestion
The most basic form of memory is text. You can add text memories using the /v1/memory
endpoint.
curl -X POST https://memory.papr.ai/v1/memory \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-H "X-Client-Type: curl" \
-d '{
"content": "The product team discussed the new feature roadmap for Q3, focusing on user analytics and performance improvements.",
"type": "text",
"metadata": {
"topics": "meeting, product, roadmap",
"hierarchical_structures": "Company/Product/Roadmap",
"createdAt": "2024-04-15",
"sourceUrl": "https://meetings.example.com/123",
"conversationId": "conv-123",
"custom_field": "You can add any custom fields here"
}
}'
Document Processing
To process documents such as PDFs, HTML files, or text files, you'll need to:
- Extract the content from the document using a parsing tool/library
- Chunk the content into logical segments
- Use the batch memory API to add the chunks with appropriate metadata
Extracting Document Content
First, extract text from your documents using appropriate tools:
import fitz # PyMuPDF for PDF extraction
import os
def extract_text_from_pdf(pdf_path):
# Open the PDF
doc = fitz.open(pdf_path)
text_by_page = []
# Extract text from each page
for page_num in range(len(doc)):
page = doc.load_page(page_num)
text = page.get_text()
text_by_page.append({
"content": text,
"page_number": page_num + 1,
"total_pages": len(doc)
})
filename = os.path.basename(pdf_path)
return {
"filename": filename,
"pages": text_by_page
}
Chunking Document Content
Split long document content into manageable chunks:
def chunk_document_text(text, max_chunk_size=1000, overlap=100):
chunks = []
start = 0
while start < len(text):
end = min(start + max_chunk_size, len(text))
# Try to find a good breaking point (e.g., end of paragraph)
if end < len(text):
# Look for paragraph or sentence break
for break_char in ['\n\n', '\n', '. ']:
break_point = text.rfind(break_char, start, end)
if break_point != -1 and break_point > start:
end = break_point + len(break_char)
break
chunks.append(text[start:end])
start = max(start, end - overlap) # Create some overlap between chunks
return chunks
Adding Document Chunks via Batch API
Use the batch memory API to add all document chunks efficiently:
from papr_memory import Papr
import os
import uuid
client = Papr(
api_key=os.environ.get("PAPR_MEMORY_API_KEY")
)
def process_document(pdf_path):
# Extract text from PDF
doc_data = extract_text_from_pdf(pdf_path)
# Create a document ID
document_id = f"doc_{uuid.uuid4().hex[:8]}"
# Prepare memory items from each page
memories = []
for page in doc_data["pages"]:
# Split page content into chunks
chunks = chunk_document_text(page["content"])
# Create memory items for each chunk
for i, chunk in enumerate(chunks):
memory_item = {
"content": chunk,
"type": "text",
"metadata": {
"document_id": document_id,
"filename": doc_data["filename"],
"page_number": page["page_number"],
"total_pages": page["total_pages"],
"chunk_index": i,
"total_chunks": len(chunks),
"hierarchical_structures": f"Documents/{doc_data['filename']}",
"createdAt": datetime.now().isoformat()
}
}
memories.append(memory_item)
# Use batch API to add all chunks
batch_response = client.memory.add_batch(
memories=memories,
batch_size=10 # Process 10 items at a time
)
print(f"Document processed: {batch_response.total_successful}/{len(memories)} chunks successful")
return document_id
Document Content Retrieval
When you need to access document content, search with specific metadata:
def retrieve_document_content(document_id):
search_response = client.memory.search(
query="Retrieve all content from this document",
metadata={
"document_id": document_id
},
max_memories=50 # Increase if document has many chunks
)
if search_response.data and search_response.data.memories:
# Sort by page number and chunk index
memories = sorted(
search_response.data.memories,
key=lambda m: (m.metadata.get("page_number", 0), m.metadata.get("chunk_index", 0))
)
return memories
return []
Code Snippet Memory
Capture code snippets with language detection:
curl -X POST https://memory.papr.ai/v1/memory \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-H "X-Client-Type: curl" \
-d '{
"content": "def calculate_total(items):\n return sum(item.price for item in items)",
"type": "code_snippet",
"metadata": {
"language": "python",
"topics": "code, pricing, utility",
"hierarchical_structures": "Code/Python/Utils",
"author": "Jane Smith",
"project": "Billing System"
}
}'
Searching Memories
Papr Memory combines vector and graph search automatically to provide the most relevant results. You can control how many memories and graph nodes are returned.
curl -X POST https://memory.papr.ai/v1/memory/search \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-H "Accept-Encoding: gzip" \
-H "X-Client-Type: curl" \
-d '{
"query": "What are the key points from our recent product planning?"
}'
Content Size Limitations
When working with Papr Memory, it's important to be aware of content size limitations:
Memory Content Size Limits
- Text and Code Snippets: The default maximum size for individual memory content is 15,000 bytes (approximately 15KB).
- Exceeding this limit will result in a
413 Payload Too Large
error. - If you need to store larger text, consider breaking it into smaller, logical chunks or using the document processing approach described above.
Document Processing Size Considerations
When processing documents:
- Split large documents into smaller chunks (recommended 1,000-2,000 characters per chunk)
- Ensure each chunk stays under the 15,000 byte limit
- Use batch operations to efficiently add multiple chunks
- For extremely large files, you may need to implement pagination in your processing code
Error Handling for Size Limits
If you attempt to upload content that exceeds the size limits, you'll receive an error response:
{
"code": 413,
"status": "error",
"error": "Content size (16000 bytes) exceeds maximum limit of 15000 bytes.",
"details": {
"max_content_length": 15000
}
}
For batch uploads, individual items that exceed the size limit will be reported in the errors array of the response.
Metadata Structure
Papr Memory allows flexible metadata to help organize and retrieve your memories effectively:
Standard Metadata Fields
topics
: String of topic labels to categorize the memory (comma-separated)hierarchical_structures
: String representing hierarchical categorization (e.g., "Department/Team/Project")createdAt
: When the content was created or relevantsourceUrl
: Link to the original sourceconversationId
: ID of the conversation the memory belongs toexternal_user_id
: External identifier for the user associated with this memoryexternal_user_read_access
: Array of external user IDs that have read accessexternal_user_write_access
: Array of external user IDs that have write access
Custom Metadata Fields
You can add any custom fields to the metadata object to meet your specific needs:
"metadata": {
"topics": "meeting, product",
"hierarchical_structures": "Company/Product/Roadmap",
"createdAt": "2024-04-15T10:00:00Z",
"emoji_tags": "📊,💡,📝",
"emotion_tags": "focused, productive",
"department": "Engineering",
"project_id": "PRJ-123",
"customer_id": "CUST-456",
"is_confidential": true,
"related_ticket": "TICKET-789",
"any_custom_field": "You can add any custom fields"
}
Best Practices
Add rich metadata to your memories to improve search and organization.
Use topics and hierarchical structures to create a consistent knowledge organization system.
Use batch processing for large volumes of memories to reduce API calls.
Process documents in logical chunks with meaningful metadata to maintain relationships between chunks.
Consider content size limits - text memories have a 15,000 byte limit by default.
Include context when available to enhance the semantic understanding of your memories.
Add relationships between memories using the
relationships_json
field to build more connected knowledge.
Troubleshooting
Issue | Solution |
---|---|
413 Payload Too Large | Break content into smaller chunks |
Missing metadata after retrieval | Ensure metadata fields use supported formats |
Low-quality embeddings | Provide more context or related information |
Next Steps
- Learn about Search Tuning
- Explore Batch Writes and Idempotency
- See the API Reference for detailed endpoint information