Last updated

Content Ingestion

This guide explains how to ingest different types of content into Papr Memory, including text, documents, and code snippets.

Overview

Papr Memory supports various content types to build a comprehensive memory system:

  • Text-based memories (notes, conversations, json, etc.)
  • Documents (PDF, HTML, TXT)
  • Code snippets (with language detection)

Memory Types

Papr Memory supports the following memory types:

  • text - Plain text content like notes, conversations, or meeting summaries (also supports JSON content)
  • code_snippet - Programming code with language detection
  • document - Document content extracted from files such as PDF, HTML, or TXT

Text Memory Ingestion

The most basic form of memory is text. You can add text memories using the /v1/memory endpoint.

curl -X POST https://memory.papr.ai/v1/memory \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Client-Type: curl" \
  -d '{
    "content": "The product team discussed the new feature roadmap for Q3, focusing on user analytics and performance improvements.",
    "type": "text",
    "metadata": {
      "topics": "meeting, product, roadmap",
      "hierarchical_structures": "Company/Product/Roadmap",
      "createdAt": "2024-04-15",
      "sourceUrl": "https://meetings.example.com/123",
      "conversationId": "conv-123",
      "custom_field": "You can add any custom fields here"
    }
  }'

Document Processing

To process documents such as PDFs, HTML files, or text files, you'll need to:

  1. Extract the content from the document using a parsing tool/library
  2. Chunk the content into logical segments
  3. Use the batch memory API to add the chunks with appropriate metadata

Extracting Document Content

First, extract text from your documents using appropriate tools:

import fitz  # PyMuPDF for PDF extraction
import os

def extract_text_from_pdf(pdf_path):
    # Open the PDF
    doc = fitz.open(pdf_path)
    text_by_page = []
    
    # Extract text from each page
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text = page.get_text()
        text_by_page.append({
            "content": text,
            "page_number": page_num + 1,
            "total_pages": len(doc)
        })
    
    filename = os.path.basename(pdf_path)
    
    return {
        "filename": filename,
        "pages": text_by_page
    }

Chunking Document Content

Split long document content into manageable chunks:

def chunk_document_text(text, max_chunk_size=1000, overlap=100):
    chunks = []
    start = 0
    
    while start < len(text):
        end = min(start + max_chunk_size, len(text))
        
        # Try to find a good breaking point (e.g., end of paragraph)
        if end < len(text):
            # Look for paragraph or sentence break
            for break_char in ['\n\n', '\n', '. ']:
                break_point = text.rfind(break_char, start, end)
                if break_point != -1 and break_point > start:
                    end = break_point + len(break_char)
                    break
        
        chunks.append(text[start:end])
        start = max(start, end - overlap)  # Create some overlap between chunks
        
    return chunks

Adding Document Chunks via Batch API

Use the batch memory API to add all document chunks efficiently:

from papr_memory import Papr
import os
import uuid

client = Papr(
    api_key=os.environ.get("PAPR_MEMORY_API_KEY")
)

def process_document(pdf_path):
    # Extract text from PDF
    doc_data = extract_text_from_pdf(pdf_path)
    
    # Create a document ID
    document_id = f"doc_{uuid.uuid4().hex[:8]}"
    
    # Prepare memory items from each page
    memories = []
    for page in doc_data["pages"]:
        # Split page content into chunks
        chunks = chunk_document_text(page["content"])
        
        # Create memory items for each chunk
        for i, chunk in enumerate(chunks):
            memory_item = {
                "content": chunk,
                "type": "text",
                "metadata": {
                    "document_id": document_id,
                    "filename": doc_data["filename"],
                    "page_number": page["page_number"], 
                    "total_pages": page["total_pages"],
                    "chunk_index": i,
                    "total_chunks": len(chunks),
                    "hierarchical_structures": f"Documents/{doc_data['filename']}",
                    "createdAt": datetime.now().isoformat()
                }
            }
            memories.append(memory_item)
    
    # Use batch API to add all chunks
    batch_response = client.memory.add_batch(
        memories=memories,
        batch_size=10  # Process 10 items at a time
    )
    
    print(f"Document processed: {batch_response.total_successful}/{len(memories)} chunks successful")
    return document_id

Document Content Retrieval

When you need to access document content, search with specific metadata:

def retrieve_document_content(document_id):
    search_response = client.memory.search(
        query="Retrieve all content from this document",
        metadata={
            "document_id": document_id
        },
        max_memories=50  # Increase if document has many chunks
    )
    
    if search_response.data and search_response.data.memories:
        # Sort by page number and chunk index
        memories = sorted(
            search_response.data.memories,
            key=lambda m: (m.metadata.get("page_number", 0), m.metadata.get("chunk_index", 0))
        )
        return memories
    return []

Code Snippet Memory

Capture code snippets with language detection:

curl -X POST https://memory.papr.ai/v1/memory \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Client-Type: curl" \
  -d '{
    "content": "def calculate_total(items):\n    return sum(item.price for item in items)",
    "type": "code_snippet",
    "metadata": {
      "language": "python",
      "topics": "code, pricing, utility",
      "hierarchical_structures": "Code/Python/Utils",
      "author": "Jane Smith",
      "project": "Billing System"
    }
  }'

Searching Memories

Papr Memory combines vector and graph search automatically to provide the most relevant results. You can control how many memories and graph nodes are returned.

curl -X POST https://memory.papr.ai/v1/memory/search \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "Accept-Encoding: gzip" \
  -H "X-Client-Type: curl" \
  -d '{
    "query": "What are the key points from our recent product planning?"
  }'

Content Size Limitations

When working with Papr Memory, it's important to be aware of content size limitations:

Memory Content Size Limits

  • Text and Code Snippets: The default maximum size for individual memory content is 15,000 bytes (approximately 15KB).
  • Exceeding this limit will result in a 413 Payload Too Large error.
  • If you need to store larger text, consider breaking it into smaller, logical chunks or using the document processing approach described above.

Document Processing Size Considerations

When processing documents:

  • Split large documents into smaller chunks (recommended 1,000-2,000 characters per chunk)
  • Ensure each chunk stays under the 15,000 byte limit
  • Use batch operations to efficiently add multiple chunks
  • For extremely large files, you may need to implement pagination in your processing code

Error Handling for Size Limits

If you attempt to upload content that exceeds the size limits, you'll receive an error response:

{
  "code": 413,
  "status": "error",
  "error": "Content size (16000 bytes) exceeds maximum limit of 15000 bytes.",
  "details": {
    "max_content_length": 15000
  }
}

For batch uploads, individual items that exceed the size limit will be reported in the errors array of the response.

Metadata Structure

Papr Memory allows flexible metadata to help organize and retrieve your memories effectively:

Standard Metadata Fields

  • topics: String of topic labels to categorize the memory (comma-separated)
  • hierarchical_structures: String representing hierarchical categorization (e.g., "Department/Team/Project")
  • createdAt: When the content was created or relevant
  • sourceUrl: Link to the original source
  • conversationId: ID of the conversation the memory belongs to
  • external_user_id: External identifier for the user associated with this memory
  • external_user_read_access: Array of external user IDs that have read access
  • external_user_write_access: Array of external user IDs that have write access

Custom Metadata Fields

You can add any custom fields to the metadata object to meet your specific needs:

"metadata": {
  "topics": "meeting, product",
  "hierarchical_structures": "Company/Product/Roadmap",
  "createdAt": "2024-04-15T10:00:00Z",
  "emoji_tags": "📊,💡,📝",
  "emotion_tags": "focused, productive",
  "department": "Engineering",
  "project_id": "PRJ-123",
  "customer_id": "CUST-456",
  "is_confidential": true,
  "related_ticket": "TICKET-789",
  "any_custom_field": "You can add any custom fields"
}

Best Practices

  1. Add rich metadata to your memories to improve search and organization.

  2. Use topics and hierarchical structures to create a consistent knowledge organization system.

  3. Use batch processing for large volumes of memories to reduce API calls.

  4. Process documents in logical chunks with meaningful metadata to maintain relationships between chunks.

  5. Consider content size limits - text memories have a 15,000 byte limit by default.

  6. Include context when available to enhance the semantic understanding of your memories.

  7. Add relationships between memories using the relationships_json field to build more connected knowledge.

Troubleshooting

IssueSolution
413 Payload Too LargeBreak content into smaller chunks
Missing metadata after retrievalEnsure metadata fields use supported formats
Low-quality embeddingsProvide more context or related information

Next Steps