Last updated

Content Ingestion

This guide explains how to ingest different types of content into Papr Memory, including text, documents, and code snippets.

Overview

Papr Memory supports various content types to build a comprehensive memory system:

  • Text-based memories (notes, conversations, json, etc.)
  • Documents (PDF, HTML, TXT)
  • Code snippets (with language detection)

Memory Types

Papr Memory supports the following memory types:

  • text - Plain text content like notes, conversations, or meeting summaries (also supports JSON content)
  • code_snippet - Programming code with language detection
  • document - Document content extracted from files such as PDF, HTML, or TXT

Text Memory Ingestion

The most basic form of memory is text. You can add text memories using the /v1/memory endpoint.

curl -X POST https://memory.papr.ai/v1/memory \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Client-Type: curl" \
  -d '{
    "content": "The product team discussed the new feature roadmap for Q3, focusing on user analytics and performance improvements.",
    "type": "text",
    "metadata": {
      "topics": ["meeting", "product", "roadmap"],
      "hierarchical_structures": "Company/Product/Roadmap",
      "createdAt": "2024-04-15",
      "sourceUrl": "https://meetings.example.com/123",
      "conversationId": "conv-123",
      "custom_field": "You can add any custom fields here"
    }
  }'

Document Processing

To process documents such as PDFs, HTML files, or text files, you'll need to:

  1. Extract the content from the document using a parsing tool/library
  2. Chunk the content into logical segments
  3. Use the batch memory API to add the chunks with appropriate metadata

Extracting Document Content

First, extract text from your documents using appropriate tools:

import fitz  # PyMuPDF for PDF extraction
import os

def extract_text_from_pdf(pdf_path):
    # Open the PDF
    doc = fitz.open(pdf_path)
    text_by_page = []
    
    # Extract text from each page
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text = page.get_text()
        text_by_page.append({
            "content": text,
            "page_number": page_num + 1,
            "total_pages": len(doc)
        })
    
    filename = os.path.basename(pdf_path)
    
    return {
        "filename": filename,
        "pages": text_by_page
    }

Chunking Document Content

Split long document content into manageable chunks:

def chunk_document_text(text, max_chunk_size=1000, overlap=100):
    chunks = []
    start = 0
    
    while start < len(text):
        end = min(start + max_chunk_size, len(text))
        
        # If not at the end of the text and not at a good break point,
        # try to find a good break point (sentence ending or paragraph)
        if end < len(text):
            # Try to find sentence or paragraph break
            last_period = text.rfind('. ', start, end)
            last_newline = text.rfind('\n', start, end)
            
            if last_period > start + max_chunk_size // 2:
                end = last_period + 1  # Include the period
            elif last_newline > start + max_chunk_size // 2:
                end = last_newline + 1  # Include the newline
        
        chunks.append(text[start:end])
        
        # Move start position, considering overlap
        start = end - overlap if end < len(text) else len(text)
    
    return chunks

Batch Ingesting Document Chunks

After extracting and chunking document content, ingest it using batch operations:

import os
import uuid
from papr_memory import Papr
from datetime import datetime
import time

client = Papr(x_api_key=os.environ.get("PAPR_MEMORY_API_KEY"))  # Updated parameter name

def ingest_document(pdf_path, client):
    # Extract text from document
    document_data = extract_text_from_pdf(pdf_path)
    filename = document_data["filename"]
    
    # Prepare memory chunks
    memories = []
    
    for page in document_data["pages"]:
        # Break page content into chunks
        chunks = chunk_document_text(page["content"])
        
        for i, chunk_text in enumerate(chunks):
            memories.append({
                "content": chunk_text,
                "type": "document",
                "metadata": {
                    "topics": ["document", filename.split(".")[0].lower().replace("_", " ")],
                    "hierarchical_structures": f"Documents/{filename}",
                    "filename": filename,
                    "page_number": page["page_number"],
                    "total_pages": page["total_pages"],
                    "chunk_number": i + 1,
                    "createdAt": datetime.now().isoformat()
                }
            })
    
    # Batch add chunks with appropriate rate limiting
    batch_size = 10
    
    # You can set up a webhook to be notified when batch processing completes
    # This is especially useful for large documents with many chunks
    webhook_url = "https://your-server.com/webhooks/document-ingest-complete"
    
    for i in range(0, len(memories), batch_size):
        batch = memories[i:i + batch_size]
        response = client.memory.add_batch(
            memories=batch,
            batch_size=batch_size,
            webhook_url=webhook_url
        )
        print(f"Processed batch {i // batch_size + 1}: {response.total_successful} successes, {response.total_failed} failures")
        
        # Simple rate limiting
        time.sleep(1)
    
    return len(memories)

# Example usage
total_chunks = ingest_document("annual_report_2023.pdf", client)
print(f"Ingested document with {total_chunks} chunks")

Code Snippet Ingestion

Code snippets require special handling to preserve formatting and capture language information:

import os
from papr_memory import Papr

client = Papr(
    x_api_key=os.environ.get("PAPR_MEMORY_API_KEY"),  # Updated parameter name
)

def add_code_snippet(client, code, language="python", filename=None):
    metadata = {
        "topics": ["code", language],
        "hierarchical_structures": f"Code/{language}"
    }
    
    if filename:
        metadata["filename"] = filename
        metadata["hierarchical_structures"] = f"Code/{language}/{filename}"
    
    response = client.memory.add(
        content=code,
        type="code_snippet",
        metadata=metadata
    )
    
    return response

# Example usage
python_code = """
def fibonacci(n):
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)

# Calculate the first 10 Fibonacci numbers
for i in range(10):
    print(fibonacci(i))
"""

response = add_code_snippet(client, python_code, "python", "fibonacci.py")
print(f"Added code snippet with response code: {response.code}")

Batch Processing

For efficiently ingesting large volumes of memories, use batch operations:

import os
import uuid
from papr_memory import Papr
from datetime import datetime
import time

client = Papr(x_api_key=os.environ.get("PAPR_MEMORY_API_KEY"))  # Updated parameter name

def add_memories_in_batch(client, memories_data):
    # Convert to Papr memory format
    memories = []
    
    for data in memories_data:
        memory = {
            "content": data["content"],
            "type": data["type"],
            "metadata": {
                "topics": data.get("topics", ["untagged"]),
                "hierarchical_structures": data.get("category", "Uncategorized"),
                "createdAt": data.get("date", datetime.now().isoformat())
            }
        }
        
        # Add any additional metadata
        for key, value in data.items():
            if key not in ["content", "type", "topics", "category", "date"]:
                memory["metadata"][key] = value
        
        memories.append(memory)
    
    # Split into batches of 10
    batch_size = 10
    results = {
        "total": len(memories),
        "successful": 0,
        "failed": 0
    }
    
    for i in range(0, len(memories), batch_size):
        batch = memories[i:i + batch_size]
        try:
            response = client.memory.add_batch(
                memories=batch,
                batch_size=batch_size
            )
            results["successful"] += response.total_successful
            results["failed"] += response.total_failed
            
            # Simple rate limiting
            time.sleep(1)
        except Exception as e:
            print(f"Error processing batch {i // batch_size + 1}: {str(e)}")
            results["failed"] += len(batch)
    
    return results

# Example usage
memories_data = [
    {
        "content": "Customer reported an issue with the checkout process",
        "type": "text",
        "topics": ["customer", "issue", "checkout"],
        "category": "Support/Issues/Checkout",
        "priority": "high"
    },
    {
        "content": "Team brainstorming session results for new features",
        "type": "text",
        "topics": ["team", "brainstorming", "features"],
        "category": "Product/Planning",
        "participants": ["Alice", "Bob", "Charlie"]
    }
]

results = add_memories_in_batch(client, memories_data)
print(f"Processed {results['total']} memories: {results['successful']} successful, {results['failed']} failed")

Best Practices

  1. Chunk Appropriately: Divide long content into semantically meaningful chunks (paragraphs, sections)
  2. Rich Metadata: Add comprehensive metadata to improve organization and searchability
  3. Batch Processing: Use batch operations for large volumes of data
  4. Rate Limiting: Implement appropriate rate limiting to avoid API throttling
  5. Error Handling: Implement robust error handling and retry logic
  6. Content Type: Use the appropriate content type (text, code_snippet, document) for each memory
  7. Hierarchical Organization: Create meaningful hierarchical structures for easy navigation

Next Steps