Last updated

Osmosis Analysis - Updated Summary

Date: February 16, 2026
Status: Analysis updated based on accurate Papr capabilities


Key Corrections Made

The initial analysis underestimated how much Papr handles automatically. Here are the corrections:


1. Conflicting Statements - Much More Automatic

Initial Assessment (Incorrect)

  • ❌ Claimed: "No automatic conflict detection"
  • ❌ Claimed: "Developer must build conflict detection service"
  • ❌ Implied: Complex background services needed

Corrected Assessment (Accurate)

Papr handles automatically:

  • Deduplication: Define unique_identifiers: ["subject", "predicate", "object"] in schema
  • Multi-source tracking: Same claim from 3 documents → ONE node with 3 EXTRACTED_FROM relationships
  • Source counting: GraphQL automatically traverses relationships

Conflict detection is just a query:

query = """
query FindConflicts($subject: String!) {
  claims(where: { subject: $subject }) {
    object  # Different values = conflict
    sources { document_id, version, authority }  # Count automatically
  }
}
"""

# Returns all claims for subject, grouped by Papr's dedup
# If multiple different 'object' values → conflict detected

Developer only adds:

  • Resolution rules (count > 3? most recent? official sources?)
  • Optional: ConflictSet nodes for workflow tracking

Complexity reduction: From "complex service" to "simple query + resolution logic"


2. Contextual Metadata vs Claims - Clearer Mechanisms

Initial Assessment (Vague)

  • ⚠️ Said: "Distinction possible but must be explicitly modeled"
  • ⚠️ Unclear about mechanisms

Corrected Assessment (Specific)

Two clear mechanisms:

1. Memory metadata (about the source):

metadata={
    "version": "v2.0",
    "authority": "official",
    "document_type": "specification"
}

2. Property overrides (injected onto nodes):

memory_policy={
    "node_constraints": [{
        "node_type": "Claim",
        "set": {
            "version": "v2.0",  # Forced onto node
            "authority": "official"  # Forced onto node
            # LLM still extracts: subject, predicate, object
        }
    }]
}

Key insight: Metadata vs extracted claims vs injected properties are three distinct, well-defined mechanisms.


3. Versioned Knowledge - Simpler Than Described

Initial Assessment (Overcomplicated)

  • ⚠️ Emphasized: "Must model versions as separate nodes"
  • ⚠️ Made it seem complex

Corrected Assessment (Two Simple Options)

Option 1: Version as property (simplest):

memory_policy={
    "node_constraints": [{
        "node_type": "Claim",
        "set": {
            "version": "v2.0",
            "effective_from": "2025-01-01",
            "effective_until": None  # Current
        }
    }]
}

# Query by version
query = "claims(where: { subject: '...', version: 'v2.0' })"

Option 2: Separate version nodes (for complex version chains):

# Create KnowledgeVersion nodes
# Link with SUPERSEDES relationships
# Query version history

Key insight: Property injection via node_constraints makes version tracking trivial. Separate nodes only needed for complex version genealogy.


4. Multi-Source Corroboration - Built-In, Not Custom

Initial Assessment (Underestimated)

  • ❌ Claimed: "No automatic source counting"
  • ❌ Claimed: "Developer must build corroboration tracking"
  • ❌ Suggested: Background service to update scores

Corrected Assessment (Much Simpler)

Papr handles automatically:

  • Deduplication merges same claim → ONE node
  • Each source gets EXTRACTED_FROM relationship
  • GraphQL traverses relationships and returns all sources

Corroboration is just counting:

query = """
query GetCorroboration($subject: String!) {
  claims(where: { subject: $subject }) {
    object
    sources {  # Papr populates this automatically
      document_id
      version
      authority
      date
    }
  }
}
"""

# In your code:
for claim in result['claims']:
    source_count = len(claim['sources'])
    official_count = sum(1 for s in claim['sources'] if s['authority'] == 'official')
    
    # Apply resolution rule
    if source_count >= 3 and official_count >= 2:
        confidence = "high"

Optional: Cache score on node:

# Only if you want to avoid recounting
memory_policy={
    "node_constraints": [{
        "node_type": "Claim",
        "set": {"source_count": 3, "corroboration_score": 0.6}
    }]
}

Key insight: Counting sources is a simple length() operation on GraphQL result. No background service needed.


Updated Architecture Assessment

What Papr Handles (Revised Up to 90%)

Automatic (No Code Needed):

  • ✅ Knowledge graph storage
  • ✅ Deduplication (same claim from multiple sources)
  • ✅ Multi-source tracking (EXTRACTED_FROM relationships)
  • ✅ Source counting (GraphQL traversal)
  • ✅ Entity resolution
  • ✅ Provenance (automatic)

Developer Controls (via Config):

  • ✅ Schema design (what properties claims have)
  • ✅ unique_identifiers (what makes claims identical)
  • ✅ Property injection (version, authority, etc.)
  • ✅ GraphQL queries (conflict detection, analysis)

What Developer Builds (Revised Down to 10%)

Simple Logic:

  • Resolution rules (which claim wins in conflict?)
  • Inference engine (if A→B and B→C, then A→C)
  • Optional: Workflow tracking (ConflictSet nodes, review status)

No longer needed:

  • ❌ Conflict detection service (just a query)
  • ❌ Source counting service (GraphQL does it)
  • ❌ Deduplication logic (Papr handles it)
  • ❌ Complex background jobs (most things are queries)

Revised Complexity Assessment

Initial Estimate

  • Papr: 70% of infrastructure
  • Developer: 30% custom services

Corrected Estimate

  • Papr: 90% of infrastructure (+ automatic dedup, source counting, conflict identification)
  • Developer: 10% domain logic (resolution rules, inference, optional workflow)

Impact on Timeline

Initial Estimate

  • Phase 1 (POC): 2 weeks
  • Phase 2 (Services): 4 weeks
  • Phase 3 (Production): 4 weeks
  • Total: 10 weeks

Revised Estimate

  • Phase 1 (POC): 1 week (simpler than expected)
  • Phase 2 (Resolution): 2 weeks (just rules, no services)
  • Phase 3 (Production): 2 weeks (less to harden)
  • Total: 5 weeks ← 50% reduction

Why? Because conflict detection, source counting, and corroboration are built-in queries, not custom services.


Updated Recommendation

Strength of Recommendation: Even Stronger

Before: "Yes, use Papr - saves 6-12 months"

Now: "Absolutely yes - saves 6-12 months AND the custom logic is trivial"

Reasoning:

  1. Deduplication is automatic - Define unique_identifiers in schema, done
  2. Source counting is automatic - GraphQL returns sources, just count them
  3. Conflict detection is automatic - Query by subject, check if multiple values
  4. Resolution is simple logic - Just max() with your scoring function
  5. Version tracking is property injection - Add version via node_constraints

What seemed complex (background services) is actually simple (GraphQL queries + basic logic).


Key Messages for Developer

Message 1: Deduplication Handles Most of It

"When you define unique_identifiers on your Claim node type, Papr automatically:
- Merges same claim from multiple documents into ONE node
- Creates EXTRACTED_FROM relationship to each source
- Makes source counting a simple GraphQL query

Conflict detection becomes: 'Are there multiple Claims with different objects for same subject?'"

Message 2: Property Injection is Powerful

"Use node_constraints.set to inject metadata onto extracted nodes:
- version: "v2.0"
- authority: "official"
- extraction_date: "2026-01-15"

No need for separate metadata nodes in most cases."

Message 3: Resolution is Just Logic

"Conflict resolution is:

winner = max(claims, key=lambda c: 
    len(c['sources']) * weight_count +
    official_count(c['sources']) * weight_authority +
    recency(c['sources']) * weight_freshness
)

That's it. No complex service needed."

Example: Complete Workflow (Simplified)

Step 1: Define Schema (One Time)

schema = client.schemas.create(
    name="Osmosis",
    node_types={
        "Claim": {
            "properties": {
                "subject": {"type": "string"},
                "predicate": {"type": "string"},
                "object": {"type": "string"},
                "version": {"type": "string"},
                "authority": {"type": "string"}
            },
            "unique_identifiers": ["subject", "predicate", "object"]  # ← Dedup key
        }
    }
)

Step 2: Upload Documents (Automatic)

# Upload 3 documents mentioning "API rate limit is 1000/hour"
for doc in ["spec_v2.pdf", "blog_post.md", "email_thread.txt"]:
    client.document.upload(
        file=open(doc, "rb"),
        schema_id=schema.id,
        metadata={"authority": get_authority(doc)}
    )

# Papr automatically:
# - Extracts claim: {subject: "API rate limit", predicate: "is", object: "1000/hour"}
# - Deduplicates to ONE Claim node
# - Creates 3 EXTRACTED_FROM relationships

Step 3: Query for Conflicts (Simple)

query = """
query FindConflicts($subject: String!) {
  claims(where: { subject: $subject }) {
    object
    sources { document_id, authority }
  }
}
"""

result = await client.graphql.query(query, {"subject": "API rate limit"})

# Check for different values
values = set(c['object'] for c in result['claims'])
if len(values) > 1:
    print(f"CONFLICT: {values}")

Step 4: Resolve (Simple Logic)

# Apply resolution rule
for claim in result['claims']:
    score = len(claim['sources']) * 2 + sum(1 for s in claim['sources'] if s['authority'] == 'official') * 5
    print(f"{claim['object']}: score={score}")

winner = max(result['claims'], key=lambda c: resolution_score(c))
print(f"Winner: {winner['object']}")

Total complexity: ~50 lines of code. No background services. No complex workflows.


Bottom Line

The initial analysis was conservative about what Papr provides. The reality is:

Papr handles 90% automatically:

  • Deduplication via unique_identifiers
  • Multi-source tracking via relationships
  • Source counting via GraphQL
  • Property injection via node_constraints

Developer adds 10% as simple logic:

  • Resolution rules (scoring function)
  • Optional: Workflow tracking
  • Optional: Inference rules

Timeline reduced from 10 weeks to 5 weeks.

The case for using Papr is even stronger than initially assessed.


Files Updated

  1. OSMOSIS-USE-CASE-ANALYSIS.md - Full technical analysis
  2. OSMOSIS-EMAIL-RESPONSE.md - Email to developer
  3. OSMOSIS-WHY-IT-WORKS.md - Deep dive on "why schema works"
  4. OSMOSIS-UPDATED-SUMMARY.md - This document (corrections)

All documents now accurately reflect Papr's automatic capabilities.