Graph-aware embeddings and domain schemas

Terminology: We call this graph-aware embeddings (concept) implemented via graph transform APIs (technology). When you see policy.transform_embedding for writes or reranking_config for search, that's the implementation layer. When we discuss "domain schemas" and "structured dimensions," that's the conceptual layer this guide focuses on.
Migration Note: The legacy enable_holographic, frequency_schema_id, holographic_config, and policy.vector fields are deprecated. Use policy.transform_embedding for writes and reranking_config for search. See Deprecations and Migrations for migration guidance.

The problem

Standard vector search ranks by "semantic closeness" but can't tell if two results are close for the same reason.

A code snippet about "sorting arrays" and "sorting linked lists" might be semantically near, but they use different algorithms, data structures, and APIs. A scientific claim about "aspirin reduces heart attack risk" and "aspirin causes stomach bleeding" are both about aspirin, but one supports use and the other warns against it.

The solution

Graph-aware embeddings solve this by encoding structured domain dimensions alongside your base vector—things like programming language, operation type, evidence strength, temporal context, or custom fields you define. Search can then filter and boost by topical alignment, domain-specific context, and other axes beyond flat similarity.

Alongside your base vector, Papr extracts registered domain dimensions (topic, time, intent, entities, and other fields you define) and encodes them into fourteen frequency bands. That makes retrieval sensitive to topical alignment, temporal and situational context, and other axes your domain schema defines—beyond raw semantic recall.

Implementation: Uses Papr's graph transform pipeline (policy.transform_embedding for writes, reranking_config for search, /v1/graph/* for standalone). This guide covers concepts and schemas; API request/response shapes are in the API reference (TransformEmbeddingPolicy, RerankingConfig, GraphDomainCreate, GraphSignalField, /v1/graph/domains).

Example: Code search with vs without graph-aware

Query: "How do I sort a list in Python?"

Standard semantic search (without graph-aware):

Returns mixed results:

Python list.sort() ✅
JavaScript array.sort() ❌ (different language)
Python sorting algorithms tutorial ⚠️ (conceptual, not code)
SQL ORDER BY ❌ (different domain)

With graph-aware embeddings (cosqa schema):

Returns filtered results aligned on language=Python and primary_operation=sorting:

Python list.sort() ✅
Python sorted() function ✅
Python custom sort with key= ✅
Python heapq.nsmallest() for partial sorts ✅

Why: The schema encodes programming_domain, language, and primary_operation dimensions. H-COND scoring boosts results with high alignment on these fields.

Why it matters

Plain similarity	Graph-aware embeddings
"Close in embedding space"	"Close and aligned on the same domain dimensions (e.g. same kind of code operation, same type of scientific claim, same topic lane)"
Weak on nuance (time, role, domain jargon)	Tunable via schema: what to extract and how hard each dimension should matter (weights)
One-size-fits-all	Built-in domains for common cases, or custom schemas for your vocabulary

Choosing the right domain schema—built-in or custom—is the single biggest lever for retrieval quality when you enable this mode.

Built-in schemas

Three schemas ship by default. Each defines fourteen frequency dimensions tuned to its domain. Use their ids or shortnames (for example cosqa, scifact, general) with domain_id in policy.transform_embedding on writes and in reranking_config on search. Inspect live definitions with GET /v1/graph/domains.

cosqa — Code search

Optimized for matching code snippets to natural language queries. Reported on the CosQA benchmark: +5.5% NDCG@10 vs. cross-encoder baseline.

Hz	Dimension	Type	Weight
0.1	programming_domain	FreeText	0.6
0.5	language	Enum	0.8
1.0	query_intent	FreeText	1.0
2.0	primary_operation	FreeText	1.0
4.0	key_apis	MultiValue	1.0
6.0	data_types_used	MultiValue	0.8
8.0	operation_verbs	MultiValue	0.7
10	secondary_apis	MultiValue	0.5
14	return_behavior	FreeText	0.8
18	input_output_signature	FreeText	0.7
22	code_pattern	FreeText	0.8
26	error_handling_pattern	FreeText	0.5
30	design_paradigm	FreeText	0.6
34	algorithm_technique	FreeText	0.7

scifact — Scientific claims

Optimized for matching scientific claims to evidence passages. Reported on the SciFact benchmark: +36% NDCG@10 vs. baseline.

Hz	Dimension	Type	Weight
0.1	scientific_field	Enum	0.7
0.5	claim_type	FreeText	0.9
1.0	methodology	FreeText	1.0
2.0	evidence_strength	Enum	0.8
4.0	biological_entities	MultiValue	1.0
6.0	chemical_compounds	MultiValue	0.8
8.0	study_design	FreeText	0.7
10	outcome_measures	MultiValue	0.9
14	population_context	FreeText	0.6
18	statistical_methods	FreeText	0.7
22	causal_mechanism	FreeText	0.8
26	temporal_context	FreeText	0.5
30	contradiction_signals	FreeText	0.6
34	confidence_level	FreeText	0.7

general — General purpose

Domain-agnostic schema for mixed content. Use as a starting point for evaluation before specializing.

Hz	Dimension	Type	Weight
0.1	topic_category	FreeText	0.7
0.5	content_type	Enum	0.6
1.0	primary_subject	FreeText	1.0
2.0	intent	FreeText	1.0
4.0	key_entities	MultiValue	0.9
6.0	domain_terms	MultiValue	0.8
8.0	action_verbs	MultiValue	0.7
10	specificity	Enum	0.6
14	audience_level	Enum	0.5
18	sentiment	Enum	0.4
22	temporal_relevance	FreeText	0.5
26	geographic_context	FreeText	0.4
30	source_type	Enum	0.5
34	complexity	Enum	0.6

Table types (FreeText, Enum, MultiValue) describe the dimension; the API uses lowercase types on custom fields such as free_text, enum, and multi_value_text—see below.

About frequency bands (Hz values)

The built-in schemas show Hz values (0.1, 0.5, 2.0, etc.) representing the 14 standard brain-inspired frequency bands in holographic encoding. When registering a custom schema, your frequency field must use one of these allowed values—see CustomFrequencyField in the API reference for the exact enum (e.g., 0.1, 0.5, 2.0, 4.0, 6.0, 10.0, 12.0, 18.0, 19.0, 24.0, 30.0, 40.0, 50.0, 70.0).

Rule of thumb:

Lower Hz (0.1–2.0) → Categorical/Enum dimensions (language, domain, contract type)
Mid Hz (4.0–14) → Descriptive/FreeText dimensions (intent, operation, methodology)
Higher Hz (18–70) → List/MultiValue dimensions (APIs, entities, key terms)

Registering a custom domain schema

Define a schema for your own domain with POST /v1/graph/domains. Custom schemas are scoped to your API key and map up to fourteen fields onto the standard frequency bands. The reference defines allowed frequency values and field types for GraphDomainCreate / GraphSignalField.

POST /v1/graph/domains
X-API-Key: YOUR_KEY
Content-Type: application/json

{
  "domain_id": "acme:legal_contracts:1.0.0",
  "name": "Legal Contract Analysis",
  "description": "Domain schema for legal contract analysis",
  "signals": [
    { "frequency": 0.1, "name": "contract_type", "type": "enum", "values": ["MSA", "NDA", "SOW"], "weight": 0.8 },
    { "frequency": 0.5, "name": "jurisdiction", "type": "enum", "values": ["US", "EU", "UK"], "weight": 0.7 },
    { "frequency": 2.0, "name": "primary_obligation", "type": "free_text", "weight": 1.0 },
    { "frequency": 4.0, "name": "parties_involved", "type": "multi_value_text", "weight": 0.9 }
  ]
}

Response:

{
  "status": "success",
  "domain_id": "acme:legal_contracts:1.0.0",
  "domain": "legal_contracts",
  "num_signals": 4
}

Using the registered schema:

from papr_memory import Papr

client = Papr(x_api_key="YOUR_KEY")

# 1. Add memories with the custom schema
memory = client.memory.add(
    content="Signed NDA with Acme Corp, jurisdiction: US, expires 2027-01-01",
    policy={
        "transform_embedding": {
            "mode": "auto",
            "domain_id": "acme:legal_contracts:1.0.0"
        }
    }
)

# 2. Search with domain-specific filtering
results = client.memory.search(
    query="Find all active NDAs in US jurisdiction",
    reranking_config={
        "reranking_provider": "papr_enhanced",  # or "papr_max" for maximum quality
        "domain_id": "acme:legal_contracts:1.0.0",
        "signal_thresholds": {
            "contract_type": 0.9,  # Must be 90%+ aligned on contract type
            "jurisdiction": 0.8    # Must be 80%+ aligned on jurisdiction
        }
    }
)

Schema design principles

Enum for categorical dimensions (language, contract type)—often on lower Hz bands.
Free text for descriptive dimensions (intent, operation)—mid bands.
Multi-value for lists (APIs, entities, compounds)—higher bands.
Weight what discriminates: if language matters most, weight it 1.0; if a dimension is mostly noise, weight it down (for example 0.3).
At most fourteen fields—one per frequency band in the holographic encoding.

Custom schema co-design with Papr is usually the fastest path to production-quality, domain-aware retrieval. The schema is what makes graph-aware search tuned rather than generic.

How this maps to the API

Goal	Where
List available domains	`GET /v1/graph/domains`
Register a custom domain	`POST /v1/graph/domains`
Get domain details	`GET /v1/graph/domains/{domain_id}`
Index memories with graph-aware vectors	`policy.transform_embedding` with `domain_id` on `POST /v1/memory` and `PUT /v1/memory/{memory_id}`
Search with CAESAR-8 / filters	`reranking_config` with `domain_id` on `POST /v1/memory/search`
BYOE base embedding (standalone)	`POST /v1/graph/transform`
Rerank only (standalone)	`POST /v1/graph/rerank`
View domain signal catalog	`GET /v1/graph/domains/{domain_id}/catalog`

CAESAR-8 (the 8th generation of our graph-aware scoring) uses phase alignment between query and candidate metadata so ranking respects your domain dimensions, not only cosine distance. See TransformEmbeddingPolicy and RerankingConfig in the reference for providers, filters, and tuning.

Do you need graph-aware embeddings?

Follow this decision tree:

1. Is your baseline search "good enough"?
→ Yes: Skip graph-aware. Standard semantic + agentic graph is sufficient.
→ No: Continue ↓

2. Can you describe the quality gap in domain terms?
Examples: "Returns Python when I need JavaScript," "Mixes bug reports with feature requests," "Can't distinguish claims vs counter-evidence"
→ Yes: You have a domain mismatch → Continue ↓
→ No: Your problem might be query quality or data sparsity, not embeddings

3. Does a built-in schema match your domain?

cosqa → Code search (snippet ↔ natural language)
scifact → Scientific claims ↔ evidence
general → Mixed content domains
→ Yes: Start with built-in → Measure → Tune
→ No: Continue ↓

4. Can you define 4-14 structured dimensions for your domain?
Examples: contract_type, jurisdiction, ticket_priority, user_intent, evidence_strength
→ Yes: Register custom schema with Papr (co-design recommended)
→ No: Graph-aware mode requires a schema; revisit when you can articulate dimensions

When to enable

You need topical, temporal, or domain-specific agreement—not just "semantically nearby."
You can commit to a schema (built-in or custom) and measure ranking.
You're willing to accept extra configuration complexity for quality gains.

When to skip

Simple RAG or small corpora where baseline search is enough.
You have not chosen a schema; graph-aware mode is schema-driven.
You haven't measured a concrete relevance gap—add after baseline metrics, not by default.