Skip to content

Enrichment & Provenance

Enrichment is the process of adding AI-generated metadata to an ACO — summary, tags, entities, token counts. ACP’s enrichment model has two defining properties:

  1. Every auto-generated field carries provenance — which model generated it, when, and at what confidence.
  2. Human-authored and machine-generated fields are always distinguishable — the presence or absence of a provenance record is the canonical signal.

The provenance object on an ACO records which model generated each auto-generated field. It is a flat object where each key is a field name and each value is a provenance record.

provenance:
summary:
model: "claude-haiku-4-5"
version: "20251001"
timestamp: "2026-02-23T10:31:00Z"
confidence: 0.91
tags:
model: "claude-haiku-4-5"
version: "20251001"
timestamp: "2026-02-23T10:31:00Z"
confidence: 0.88
key_entities:
model: "gpt-4o-mini"
version: "2024-07-18"
timestamp: "2026-02-23T10:31:00Z"
confidence: 0.95
SubfieldTypeRequiredDescription
modelstringYesModel identifier used for generation.
versionstringNoModel version or checkpoint.
timestampstring (ISO 8601)YesWhen the field was generated.
confidencefloat 0.0–1.0NoModel’s self-assessed accuracy for the generated value.
  • A field with a provenance entry is machine-generated.
  • A field without one is human-authored.

This is the canonical distinction. You do not need a separate flag or field to know whether a tag was typed by a human or generated by an AI — you check whether provenance.tags exists.

If a human edits a machine-generated field, the provenance entry SHOULD be removed. The field is now human-authored.


ACP distinguishes two kinds of confidence scores that coexist on an object but measure fundamentally different things.

confidence: 0.82

A float from 0.0 to 1.0 representing the behavioral relevance of this object — how reliable it has proven to be as a reference source, computed from engagement signals: saves, shares, comments, recency of interaction, collection membership, and cross-referencing frequency.

This is NOT a model accuracy score. It is a signal about the object’s utility to consumers, computed by the implementation from usage patterns.

provenance:
summary:
confidence: 0.91
tags:
confidence: 0.88

Each provenance record carries the generating model’s self-assessed accuracy for that specific field. This reflects how confident the model was in its output at generation time — not how useful the object has proven to be over time.

ACO-level confidenceProvenance confidence
What it measuresBehavioral relevance and utilityModel accuracy at generation time
Who sets itImplementation (engagement-based)Generating model
When it changesOver time as usage data accumulatesOnly if the field is regenerated
Cross-model comparabilityImplementation-specificNOT guaranteed to be comparable across models

An object can be highly saved and referenced (high behavioral confidence) while having low-confidence auto-generated tags. Or an object can have high-confidence AI-generated fields but low usage over time. These are independent signals.

Guidance: Implementations SHOULD surface enrichments with per-field provenance confidence below 0.7 for human review. Implementations MAY define minimum thresholds below which auto-generated fields are not displayed.


The four core enrichments ACP is designed to support:

A concise human-readable description of the content body. Max 500 characters recommended. Useful for agents to preview content before deciding whether to fetch the full body.

summary: "Analysis of how tokenizer divergence across models affects context window planning for AI agents."
provenance:
summary:
model: "claude-haiku-4-5"
timestamp: "2026-02-23T10:31:00Z"
confidence: 0.91

Classification tags for search, filtering, and clustering. Lowercase recommended.

tags: ["tokenizers", "context-window", "ai-agents", "llm-infrastructure"]
provenance:
tags:
model: "claude-haiku-4-5"
timestamp: "2026-02-23T10:31:00Z"
confidence: 0.88

Typed named entities extracted from the content body. Enables structured queries: “show me all ACOs mentioning Anthropic with confidence > 0.9.”

key_entities:
- type: "organization"
name: "Anthropic"
confidence: 0.98
- type: "technology"
name: "Claude"
confidence: 0.97
- type: "concept"
name: "tokenization"
confidence: 0.93
provenance:
key_entities:
model: "claude-haiku-4-5"
timestamp: "2026-02-23T10:31:00Z"
confidence: 0.95

Note: Entity-level confidence values inherit their model identity from provenance.key_entities. Per-entity provenance is not carried individually — the batch record covers all entities.

Per-tokenizer token counts. These are typically computed deterministically (not probabilistically), so provenance records for token_counts are not required — but they are valid if your implementation computes them with a model.

token_counts:
cl100k: 2847
claude: 2791
approximate: 2830

Enrichment pipelines SHOULD follow these rules:

  1. Skip if provenance exists. If provenance.summary already exists, do not regenerate the summary unless the caller passes a force flag.
  2. Force to overwrite. A force flag regenerates and overwrites the field and updates the provenance record.
  3. Never overwrite human-authored fields. If a field has no provenance record, it is human-authored. Do not overwrite it automatically — even with force. Require explicit human confirmation.
  4. Recompute on content change. If the content body changes (detectable via content_hash), all auto-generated fields SHOULD be flagged as stale. The provenance record remains until regeneration.

From the ACP research synthesis (non-normative):

  • Enrichment cost: approximately $0.002 per article using GPT-4o-mini or Claude Haiku
  • Latency: 0.8–2.2 seconds per ACO
  • YAML frontmatter is approximately 18% more efficient than JSON for metadata

These figures are reference points. Actual costs and latency depend on content length, model selection, and implementation.