Content Ingestion
This guide explains how to ingest different types of content into Papr Memory, including text, documents, and code snippets.
Overview
Papr Memory supports various content types to build a comprehensive memory system:
- Text-based memories (notes, conversations, json, etc.)
- Documents (PDF, HTML, TXT)
- Code snippets (with language detection)
Memory Types
Papr Memory supports the following memory types:
text
- Plain text content like notes, conversations, or meeting summaries (also supports JSON content)code_snippet
- Programming code with language detectiondocument
- Document content extracted from files such as PDF, HTML, or TXT
Text Memory Ingestion
The most basic form of memory is text. You can add text memories using the /v1/memory
endpoint.
curl -X POST https://memory.papr.ai/v1/memory \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-H "X-Client-Type: curl" \
-d '{
"content": "The product team discussed the new feature roadmap for Q3, focusing on user analytics and performance improvements.",
"type": "text",
"metadata": {
"topics": ["meeting", "product", "roadmap"],
"hierarchical_structures": "Company/Product/Roadmap",
"createdAt": "2024-04-15",
"sourceUrl": "https://meetings.example.com/123",
"conversationId": "conv-123",
"custom_field": "You can add any custom fields here"
}
}'
Document Processing
To process documents such as PDFs, HTML files, or text files, you'll need to:
- Extract the content from the document using a parsing tool/library
- Chunk the content into logical segments
- Use the batch memory API to add the chunks with appropriate metadata
Extracting Document Content
First, extract text from your documents using appropriate tools:
import fitz # PyMuPDF for PDF extraction
import os
def extract_text_from_pdf(pdf_path):
# Open the PDF
doc = fitz.open(pdf_path)
text_by_page = []
# Extract text from each page
for page_num in range(len(doc)):
page = doc.load_page(page_num)
text = page.get_text()
text_by_page.append({
"content": text,
"page_number": page_num + 1,
"total_pages": len(doc)
})
filename = os.path.basename(pdf_path)
return {
"filename": filename,
"pages": text_by_page
}
Chunking Document Content
Split long document content into manageable chunks:
def chunk_document_text(text, max_chunk_size=1000, overlap=100):
chunks = []
start = 0
while start < len(text):
end = min(start + max_chunk_size, len(text))
# If not at the end of the text and not at a good break point,
# try to find a good break point (sentence ending or paragraph)
if end < len(text):
# Try to find sentence or paragraph break
last_period = text.rfind('. ', start, end)
last_newline = text.rfind('\n', start, end)
if last_period > start + max_chunk_size // 2:
end = last_period + 1 # Include the period
elif last_newline > start + max_chunk_size // 2:
end = last_newline + 1 # Include the newline
chunks.append(text[start:end])
# Move start position, considering overlap
start = end - overlap if end < len(text) else len(text)
return chunks
Batch Ingesting Document Chunks
After extracting and chunking document content, ingest it using batch operations:
import os
import uuid
from papr_memory import Papr
from datetime import datetime
import time
client = Papr(x_api_key=os.environ.get("PAPR_MEMORY_API_KEY")) # Updated parameter name
def ingest_document(pdf_path, client):
# Extract text from document
document_data = extract_text_from_pdf(pdf_path)
filename = document_data["filename"]
# Prepare memory chunks
memories = []
for page in document_data["pages"]:
# Break page content into chunks
chunks = chunk_document_text(page["content"])
for i, chunk_text in enumerate(chunks):
memories.append({
"content": chunk_text,
"type": "document",
"metadata": {
"topics": ["document", filename.split(".")[0].lower().replace("_", " ")],
"hierarchical_structures": f"Documents/{filename}",
"filename": filename,
"page_number": page["page_number"],
"total_pages": page["total_pages"],
"chunk_number": i + 1,
"createdAt": datetime.now().isoformat()
}
})
# Batch add chunks with appropriate rate limiting
batch_size = 10
# You can set up a webhook to be notified when batch processing completes
# This is especially useful for large documents with many chunks
webhook_url = "https://your-server.com/webhooks/document-ingest-complete"
for i in range(0, len(memories), batch_size):
batch = memories[i:i + batch_size]
response = client.memory.add_batch(
memories=batch,
batch_size=batch_size,
webhook_url=webhook_url
)
print(f"Processed batch {i // batch_size + 1}: {response.total_successful} successes, {response.total_failed} failures")
# Simple rate limiting
time.sleep(1)
return len(memories)
# Example usage
total_chunks = ingest_document("annual_report_2023.pdf", client)
print(f"Ingested document with {total_chunks} chunks")
Code Snippet Ingestion
Code snippets require special handling to preserve formatting and capture language information:
import os
from papr_memory import Papr
client = Papr(
x_api_key=os.environ.get("PAPR_MEMORY_API_KEY"), # Updated parameter name
)
def add_code_snippet(client, code, language="python", filename=None):
metadata = {
"topics": ["code", language],
"hierarchical_structures": f"Code/{language}"
}
if filename:
metadata["filename"] = filename
metadata["hierarchical_structures"] = f"Code/{language}/{filename}"
response = client.memory.add(
content=code,
type="code_snippet",
metadata=metadata
)
return response
# Example usage
python_code = """
def fibonacci(n):
if n <= 1:
return n
else:
return fibonacci(n-1) + fibonacci(n-2)
# Calculate the first 10 Fibonacci numbers
for i in range(10):
print(fibonacci(i))
"""
response = add_code_snippet(client, python_code, "python", "fibonacci.py")
print(f"Added code snippet with response code: {response.code}")
Batch Processing
For efficiently ingesting large volumes of memories, use batch operations:
import os
import uuid
from papr_memory import Papr
from datetime import datetime
import time
client = Papr(x_api_key=os.environ.get("PAPR_MEMORY_API_KEY")) # Updated parameter name
def add_memories_in_batch(client, memories_data):
# Convert to Papr memory format
memories = []
for data in memories_data:
memory = {
"content": data["content"],
"type": data["type"],
"metadata": {
"topics": data.get("topics", ["untagged"]),
"hierarchical_structures": data.get("category", "Uncategorized"),
"createdAt": data.get("date", datetime.now().isoformat())
}
}
# Add any additional metadata
for key, value in data.items():
if key not in ["content", "type", "topics", "category", "date"]:
memory["metadata"][key] = value
memories.append(memory)
# Split into batches of 10
batch_size = 10
results = {
"total": len(memories),
"successful": 0,
"failed": 0
}
for i in range(0, len(memories), batch_size):
batch = memories[i:i + batch_size]
try:
response = client.memory.add_batch(
memories=batch,
batch_size=batch_size
)
results["successful"] += response.total_successful
results["failed"] += response.total_failed
# Simple rate limiting
time.sleep(1)
except Exception as e:
print(f"Error processing batch {i // batch_size + 1}: {str(e)}")
results["failed"] += len(batch)
return results
# Example usage
memories_data = [
{
"content": "Customer reported an issue with the checkout process",
"type": "text",
"topics": ["customer", "issue", "checkout"],
"category": "Support/Issues/Checkout",
"priority": "high"
},
{
"content": "Team brainstorming session results for new features",
"type": "text",
"topics": ["team", "brainstorming", "features"],
"category": "Product/Planning",
"participants": ["Alice", "Bob", "Charlie"]
}
]
results = add_memories_in_batch(client, memories_data)
print(f"Processed {results['total']} memories: {results['successful']} successful, {results['failed']} failed")
Best Practices
- Chunk Appropriately: Divide long content into semantically meaningful chunks (paragraphs, sections)
- Rich Metadata: Add comprehensive metadata to improve organization and searchability
- Batch Processing: Use batch operations for large volumes of data
- Rate Limiting: Implement appropriate rate limiting to avoid API throttling
- Error Handling: Implement robust error handling and retry logic
- Content Type: Use the appropriate content type (
text
,code_snippet
,document
) for each memory - Hierarchical Organization: Create meaningful hierarchical structures for easy navigation
Next Steps
- Learn about Context Handling to add rich context to your memories
- Explore Retrieval Strategies for efficiently searching your memories
- See the complete API Reference for detailed parameter information