Last updated

Custom Knowledge Graph Schemas

Define your domain ontology to guide how Papr analyzes content and extracts entities.

Overview

Custom schemas allow you to define the structure of your knowledge graph by specifying:

  • What types of entities exist in your domain (node types)
  • What properties those entities have
  • How entities relate to each other (relationship types)
  • Validation rules and constraints

When you upload documents or add memories, the system uses your schema to guide entity extraction and ensure consistent graph structure.

Why Custom Schemas

Domain-Specific Extraction

Guide the system to extract entities specific to your domain:

  • Legal: Contracts, Parties, Clauses, Obligations, Deadlines
  • Medical: Patients, Diagnoses, Treatments, Medications, Procedures
  • Code: Functions, Classes, Variables, Dependencies, Bugs
  • E-commerce: Products, Customers, Orders, Payments, Reviews
  • CRM: Companies, Contacts, Opportunities, Interactions

Consistent Property Definitions

Ensure all entities of the same type have consistent properties across your entire knowledge graph.

Control Entity Resolution

Choose between semantic similarity matching (for open-ended values) or exact matching (for controlled vocabularies) when deduplicating entities.

Automatic Indexing

Required properties are automatically indexed in Neo4j for fast query performance.

How Schemas Work

  1. You define: Create a schema specifying node types, properties, and relationships
  2. You upload/add: Upload documents or add memories to the system
  3. System analyzes: Content is analyzed to identify relevant information
  4. Schema guides: Your schema guides what entities and relationships to extract
  5. Predictive models build: Predictive models build the knowledge graph following your structure
  6. Schema ensures consistency: Your schema ensures consistent structure across all content

Schema Components

Node Types

Node types define entities in your domain. Each node type has:

  • name: Unique identifier (must match pattern ^[A-Za-z][A-Za-z0-9_]*$)
  • label: Display label for the node type
  • description: Optional description for documentation
  • properties: Object defining all properties for this node type
  • required_properties: List of properties that must be present
  • unique_identifiers: Properties used for entity deduplication
  • color: Optional color for visualization (hex code)
  • icon: Optional icon name for visualization

Properties

Properties define attributes of nodes and relationships:

  • type: Data type (string, integer, float, boolean, datetime, array, object)
  • required: Whether the property must be present (boolean)
  • default: Default value if not provided
  • description: LLM-friendly description guiding extraction
  • min_length/max_length: For strings
  • min_value/max_value: For numbers
  • enum_values: List of allowed values (max 10)
  • pattern: Regex pattern for validation

Relationship Types

Relationship types define how entities connect:

  • name: Unique identifier (must match pattern ^[A-Z][A-Z0-9_]*$)
  • label: Display label
  • description: Optional description
  • allowed_source_types: List of node types that can be the source
  • allowed_target_types: List of node types that can be the target
  • properties: Optional properties for the relationship
  • cardinality: one-to-one, one-to-many, or many-to-many (default)
  • color: Optional color for visualization

Complete E-commerce Example

Here's a complete schema for an e-commerce domain:

from papr_memory import Papr
import os

client = Papr(x_api_key=os.environ.get("PAPR_MEMORY_API_KEY"))

schema = client.schemas.create(
    name="E-commerce Schema",
    description="Product catalog and customer relationships for e-commerce operations",
    version="1.0.0",
    node_types={
        "Product": {
            "name": "Product",
            "label": "Product",
            "description": "E-commerce product with pricing and inventory",
            "properties": {
                "name": {
                    "type": "string",
                    "required": True,
                    "description": "Product name, typically 2-4 words like 'iPhone 15 Pro' or 'Nike Running Shoes'"
                },
                "price": {
                    "type": "float",
                    "required": True,
                    "description": "Price in USD as decimal number (e.g., 999.99, 29.95)",
                    "min_value": 0
                },
                "category": {
                    "type": "string",
                    "required": True,
                    "description": "Main product category - choose the most appropriate category for this item",
                    "enum_values": ["electronics", "clothing", "books", "home", "sports"]
                },
                "condition": {
                    "type": "string",
                    "required": False,
                    "description": "Physical condition of the product - use 'new' for brand new items, 'like_new' for barely used",
                    "enum_values": ["new", "like_new", "good", "fair", "poor"],
                    "default": "new"
                },
                "in_stock": {
                    "type": "boolean",
                    "required": True,
                    "description": "Availability status - true if currently available for purchase, false if out of stock"
                },
                "sku": {
                    "type": "string",
                    "required": True,
                    "description": "Stock keeping unit - exact alphanumeric code for inventory tracking",
                    "enum_values": ["SKU-001", "SKU-002", "SKU-003", "SKU-004", "SKU-005"]
                },
                "description": {
                    "type": "string",
                    "required": False,
                    "description": "Detailed product description",
                    "max_length": 1000
                }
            },
            "required_properties": ["name", "price", "category", "in_stock", "sku"],
            "unique_identifiers": ["name", "sku"],  # name: semantic, sku: exact
            "color": "#e74c3c"
        },
        "Customer": {
            "name": "Customer",
            "label": "Customer",
            "description": "Customer with purchase history and loyalty tier",
            "properties": {
                "name": {
                    "type": "string",
                    "required": True,
                    "description": "Customer full name"
                },
                "email": {
                    "type": "string",
                    "required": True,
                    "description": "Customer email address for contact and identification"
                },
                "tier": {
                    "type": "string",
                    "required": False,
                    "description": "Customer loyalty tier based on purchase history",
                    "enum_values": ["bronze", "silver", "gold"],
                    "default": "bronze"
                },
                "join_date": {
                    "type": "datetime",
                    "required": False,
                    "description": "Date when customer created account"
                }
            },
            "required_properties": ["name", "email"],
            "unique_identifiers": ["email"],
            "color": "#3498db"
        },
        "Review": {
            "name": "Review",
            "label": "Review",
            "description": "Product review with rating and text",
            "properties": {
                "rating": {
                    "type": "integer",
                    "required": True,
                    "description": "Star rating from 1 to 5",
                    "min_value": 1,
                    "max_value": 5
                },
                "text": {
                    "type": "string",
                    "required": False,
                    "description": "Review text content",
                    "max_length": 2000
                },
                "verified_purchase": {
                    "type": "boolean",
                    "required": False,
                    "description": "Whether this review is from a verified purchase",
                    "default": False
                },
                "review_date": {
                    "type": "datetime",
                    "required": True,
                    "description": "Date when review was posted"
                }
            },
            "required_properties": ["rating", "review_date"],
            "unique_identifiers": [],
            "color": "#f39c12"
        }
    },
    relationship_types={
        "PURCHASED": {
            "name": "PURCHASED",
            "label": "Purchased",
            "description": "Customer purchased a product",
            "allowed_source_types": ["Customer"],
            "allowed_target_types": ["Product"],
            "properties": {
                "date": {
                    "type": "datetime",
                    "required": True,
                    "description": "Purchase date"
                },
                "amount": {
                    "type": "float",
                    "required": True,
                    "description": "Purchase amount in USD"
                },
                "quantity": {
                    "type": "integer",
                    "required": False,
                    "description": "Number of items purchased",
                    "default": 1
                }
            },
            "cardinality": "many-to-many",
            "color": "#2ecc71"
        },
        "REVIEWED": {
            "name": "REVIEWED",
            "label": "Reviewed",
            "description": "Customer wrote a review",
            "allowed_source_types": ["Customer"],
            "allowed_target_types": ["Review"],
            "cardinality": "one-to-many",
            "color": "#9b59b6"
        },
        "REVIEW_OF": {
            "name": "REVIEW_OF",
            "label": "Review Of",
            "description": "Review is about a product",
            "allowed_source_types": ["Review"],
            "allowed_target_types": ["Product"],
            "cardinality": "many-to-one",
            "color": "#95a5a6"
        }
    }
)

print(f"Schema created with ID: {schema.data.id}")

TypeScript Example

import Papr from '@papr/memory';

const client = new Papr({
  xAPIKey: process.env.PAPR_MEMORY_API_KEY
});

const schema = await client.schemas.create({
  name: "E-commerce Schema",
  description: "Product catalog and customer relationships",
  version: "1.0.0",
  node_types: {
    Product: {
      name: "Product",
      label: "Product",
      properties: {
        name: {
          type: "string",
          required: true,
          description: "Product name, typically 2-4 words"
        },
        price: {
          type: "float",
          required: true,
          description: "Price in USD as decimal number"
        },
        category: {
          type: "string",
          required: true,
          enum_values: ["electronics", "clothing", "books", "home", "sports"]
        },
        in_stock: {
          type: "boolean",
          required: true
        }
      },
      required_properties: ["name", "price", "category", "in_stock"],
      unique_identifiers: ["name"]
    }
  },
  relationship_types: {
    PURCHASED: {
      name: "PURCHASED",
      allowed_source_types: ["Customer"],
      allowed_target_types: ["Product"]
    }
  }
});

console.log(`Schema created with ID: ${schema.data.id}`);

Key Concepts

LLM-Friendly Descriptions

Write detailed property descriptions that guide the LLM on expected formats and usage:

Good examples:

{
    "name": {
        "description": "Product name, typically 2-4 words like 'iPhone 15 Pro' or 'Nike Running Shoes'"
    },
    "price": {
        "description": "Price in USD as decimal number (e.g., 999.99, 29.95)"
    },
    "status": {
        "description": "use 'new' for brand new items, 'like_new' for barely used, 'good' for normal wear"
    }
}

Poor examples:

{
    "name": {"description": "Name"},  # Too vague
    "price": {"description": "Price"},  # No guidance on format
    "status": {"description": "Status"}  # No explanation of values
}

Enum Values

Use enums to restrict property values to a predefined list (max 10 values).

When to use enums:

  • Limited, well-defined options (≤10 values)
  • Controlled vocabularies: "active/inactive", "high/medium/low"
  • Status codes, priority levels, categories
  • When you want exact matching

When to avoid enums:

  • Open-ended text fields: names, titles, descriptions
  • Large sets of options (>10): countries, cities
  • When you want semantic similarity matching
  • Dynamic or frequently changing value sets

Example:

{
    "priority": {
        "type": "string",
        "enum_values": ["low", "medium", "high", "critical"],
        "description": "Task priority level"
    },
    "status": {
        "type": "string",
        "enum_values": ["draft", "active", "completed", "archived"],
        "description": "Current status of the item"
    }
}

Entity Resolution: Semantic vs Exact Matching

Properties in unique_identifiers are used for entity deduplication:

Without enum_values (Semantic Similarity):

  • Uses semantic matching to identify similar entities
  • Merges "Apple Inc" and "Apple Inc." as the same entity
  • Merges "John Smith" and "J. Smith" if context suggests same person
  • Best for open-ended values like company names, person names

With enum_values (Exact Matching):

  • Only entities with exactly matching enum values are merged
  • "SKU-001" only matches "SKU-001", not "SKU-002"
  • Best for controlled identifiers like status codes, SKUs, categories

Example:

{
    "Product": {
        "properties": {
            "name": {
                "type": "string",
                "required": True
                # No enum_values = semantic matching
            },
            "sku": {
                "type": "string",
                "required": True,
                "enum_values": ["SKU-001", "SKU-002", "SKU-003"]
                # With enum_values = exact matching
            }
        },
        "unique_identifiers": ["name", "sku"]
        # name uses semantic similarity
        # sku uses exact matching
    }
}

Schema Lifecycle

Schemas go through a lifecycle:

  1. Draft - Schema is being developed, not used in production
  2. Active - Schema is used for memory extraction and graph generation
  3. Deprecated - Schema is marked for removal, but existing data remains
  4. Archived - Schema is no longer used, preserved for historical data
# Create schema in draft mode
schema = client.schemas.create(
    name="My Schema",
    status="draft",
    # ... rest of schema
)

# Activate when ready
client.schemas.activate(schema.data.id, activate=True)

# Later, deprecate
client.schemas.update(schema.data.id, {"status": "deprecated"})

Using Schemas with Documents

Once you've created a schema, use it when uploading documents:

# Upload document with custom schema
response = client.document.upload(
    file=open("product_catalog.pdf", "rb"),
    schema_id=schema.data.id,
    simple_schema_mode=True,  # Recommended: system + one custom schema
    hierarchical_enabled=True
)

The system will use your schema to guide entity and relationship extraction from the document.

Using Schemas with Memory

Use schemas when adding memories directly:

# Add memory with graph generation using schema
response = client.memory.add(
    content="Customer Jane Doe purchased iPhone 15 Pro for $999 on 2024-03-15",
    graph_generation={
        "mode": "auto",
        "auto": {
            "schema_id": schema.data.id,
            "simple_schema_mode": True
        }
    }
)

Managing Schemas

List All Schemas

schemas = client.schemas.list()
for schema in schemas.data:
    print(f"{schema.name} ({schema.status})")

Get Specific Schema

schema = client.schemas.retrieve(schema_id)
print(schema.data.name)
print(schema.data.node_types)

Update Schema

updated = client.schemas.update(
    schema_id,
    {
        "description": "Updated description",
        "node_types": {
            # Add or modify node types
        }
    }
)

Delete Schema

# Soft delete (archives the schema)
client.schemas.delete(schema_id)

Activate/Deactivate

# Activate schema for use
client.schemas.activate(schema_id, activate=True)

# Deactivate schema
client.schemas.activate(schema_id, activate=False)

Best Practices

1. Start Simple, Iterate

Begin with a basic schema covering your core entities:

{
    "node_types": {
        "Customer": { /* minimal properties */ },
        "Product": { /* minimal properties */ }
    }
}

Add complexity as you understand your use case better.

2. Write Clear Descriptions

Every property should have a clear, LLM-friendly description:

{
    "contract_value": {
        "type": "float",
        "description": "Total contract value in USD, including all fees and charges. Format as decimal (e.g., 50000.00)"
    }
}

3. Use Simple Schema Mode in Production

response = client.document.upload(
    file=file,
    schema_id="your_schema",
    simple_schema_mode=True  # System + one custom schema = consistency
)

This ensures consistency between document processing and direct memory creation.

4. Limit Node Types (≤15 per schema)

Too many node types make extraction less accurate. Focus on your most important entities.

5. Limit Relationship Types (≤20 per schema)

Keep relationships meaningful and avoid over-specification.

6. Use Enums Sparingly (≤10 values)

Only use enums for truly controlled vocabularies. Open-ended fields should not have enums.

7. Mark Properties as Required Thoughtfully

Only mark properties as required if they're truly essential. Missing required properties can cause extraction failures.

Common Domain Examples

{
    "node_types": {
        "Contract": {
            "properties": {
                "title": {"type": "string", "required": True},
                "type": {
                    "type": "string",
                    "enum_values": ["service", "employment", "nda", "partnership"]
                },
                "effective_date": {"type": "datetime"},
                "expiration_date": {"type": "datetime"}
            }
        },
        "Party": {
            "properties": {
                "name": {"type": "string", "required": True},
                "role": {
                    "type": "string",
                    "enum_values": ["client", "vendor", "employee", "partner"]
                }
            }
        },
        "Obligation": {
            "properties": {
                "description": {"type": "string", "required": True},
                "deadline": {"type": "datetime"},
                "status": {
                    "type": "string",
                    "enum_values": ["pending", "completed", "overdue"]
                }
            }
        }
    }
}

Medical Domain

{
    "node_types": {
        "Patient": {
            "properties": {
                "name": {"type": "string", "required": True},
                "dob": {"type": "datetime"},
                "medical_record_number": {"type": "string"}
            }
        },
        "Diagnosis": {
            "properties": {
                "icd_code": {"type": "string"},
                "description": {"type": "string", "required": True},
                "severity": {
                    "type": "string",
                    "enum_values": ["mild", "moderate", "severe", "critical"]
                }
            }
        },
        "Treatment": {
            "properties": {
                "name": {"type": "string", "required": True},
                "start_date": {"type": "datetime"},
                "duration_days": {"type": "integer"}
            }
        }
    }
}

Code Repository Domain

{
    "node_types": {
        "Function": {
            "properties": {
                "name": {"type": "string", "required": True},
                "language": {
                    "type": "string",
                    "enum_values": ["python", "javascript", "typescript", "java"]
                },
                "description": {"type": "string"},
                "complexity": {
                    "type": "string",
                    "enum_values": ["low", "medium", "high"]
                }
            }
        },
        "Class": {
            "properties": {
                "name": {"type": "string", "required": True},
                "file_path": {"type": "string"}
            }
        },
        "Bug": {
            "properties": {
                "title": {"type": "string", "required": True},
                "severity": {
                    "type": "string",
                    "enum_values": ["low", "medium", "high", "critical"]
                },
                "status": {
                    "type": "string",
                    "enum_values": ["open", "in_progress", "resolved", "closed"]
                }
            }
        }
    }
}

Troubleshooting

Schema Validation Errors

If schema creation fails validation:

  • Check node type names match pattern ^[A-Za-z][A-Za-z0-9_]*$
  • Check relationship type names match pattern ^[A-Z][A-Z0-9_]*$
  • Verify enum_values has ≤10 items
  • Ensure required_properties reference existing properties

Extraction Not Finding Entities

  • Add more detailed, LLM-friendly property descriptions
  • Verify property types match expected data
  • Check if required properties are too strict
  • Try manual graph generation mode for debugging

Too Many Duplicate Entities

  • Add more unique_identifiers
  • Use enums for controlled values that should match exactly
  • Consider property overrides in document upload

Entities Not Merging

  • Check if unique_identifiers are set correctly
  • For semantic matching, remove enum_values
  • For exact matching, add enum_values

Next Steps