Enrichment & Provenance

Enrichment is the process of adding AI-generated metadata to an ACO — summary, tags, entities, token counts. ACP’s enrichment model has two defining properties:

Every auto-generated field carries provenance — which model generated it, when, and at what confidence.
Human-authored and machine-generated fields are always distinguishable — the presence or absence of a provenance record is the canonical signal.

Per-Field Provenance

The provenance object on an ACO records which model generated each auto-generated field. It is a flat object where each key is a field name and each value is a provenance record.

provenance:
  summary:
    model: "claude-haiku-4-5"
    version: "20251001"
    timestamp: "2026-02-23T10:31:00Z"
    confidence: 0.91
  tags:
    model: "claude-haiku-4-5"
    version: "20251001"
    timestamp: "2026-02-23T10:31:00Z"
    confidence: 0.88
  key_entities:
    model: "gpt-4o-mini"
    version: "2024-07-18"
    timestamp: "2026-02-23T10:31:00Z"
    confidence: 0.95

Subfield	Type	Required	Description
`model`	string	Yes	Model identifier used for generation.
`version`	string	No	Model version or checkpoint.
`timestamp`	string (ISO 8601)	Yes	When the field was generated.
`confidence`	float 0.0–1.0	No	Model’s self-assessed accuracy for the generated value.

The provenance signal

A field with a provenance entry is machine-generated.
A field without one is human-authored.

This is the canonical distinction. You do not need a separate flag or field to know whether a tag was typed by a human or generated by an AI — you check whether provenance.tags exists.

If a human edits a machine-generated field, the provenance entry SHOULD be removed. The field is now human-authored.

Dual Confidence Model

ACP distinguishes two kinds of confidence scores that coexist on an object but measure fundamentally different things.

ACO-level `confidence`

confidence: 0.82

A float from 0.0 to 1.0 representing the behavioral relevance of this object — how reliable it has proven to be as a reference source, computed from engagement signals: saves, shares, comments, recency of interaction, collection membership, and cross-referencing frequency.

This is NOT a model accuracy score. It is a signal about the object’s utility to consumers, computed by the implementation from usage patterns.

Per-field provenance `confidence`

provenance:
  summary:
    confidence: 0.91
  tags:
    confidence: 0.88

Each provenance record carries the generating model’s self-assessed accuracy for that specific field. This reflects how confident the model was in its output at generation time — not how useful the object has proven to be over time.

Why they are different

	ACO-level `confidence`	Provenance `confidence`
What it measures	Behavioral relevance and utility	Model accuracy at generation time
Who sets it	Implementation (engagement-based)	Generating model
When it changes	Over time as usage data accumulates	Only if the field is regenerated
Cross-model comparability	Implementation-specific	NOT guaranteed to be comparable across models

An object can be highly saved and referenced (high behavioral confidence) while having low-confidence auto-generated tags. Or an object can have high-confidence AI-generated fields but low usage over time. These are independent signals.

Guidance: Implementations SHOULD surface enrichments with per-field provenance confidence below 0.7 for human review. Implementations MAY define minimum thresholds below which auto-generated fields are not displayed.

Enrichment Pipelines

The four core enrichments ACP is designed to support:

Summary

A concise human-readable description of the content body. Max 500 characters recommended. Useful for agents to preview content before deciding whether to fetch the full body.

summary: "Analysis of how tokenizer divergence across models affects context window planning for AI agents."
provenance:
  summary:
    model: "claude-haiku-4-5"
    timestamp: "2026-02-23T10:31:00Z"
    confidence: 0.91

Key Entities

Typed named entities extracted from the content body. Enables structured queries: “show me all ACOs mentioning Anthropic with confidence > 0.9.”

key_entities:
  - type: "organization"
    name: "Anthropic"
    confidence: 0.98
  - type: "technology"
    name: "Claude"
    confidence: 0.97
  - type: "concept"
    name: "tokenization"
    confidence: 0.93
provenance:
  key_entities:
    model: "claude-haiku-4-5"
    timestamp: "2026-02-23T10:31:00Z"
    confidence: 0.95

Note: Entity-level confidence values inherit their model identity from provenance.key_entities. Per-entity provenance is not carried individually — the batch record covers all entities.

Token Counts

Per-tokenizer token counts. These are typically computed deterministically (not probabilistically), so provenance records for token_counts are not required — but they are valid if your implementation computes them with a model.

token_counts:
  cl100k: 2847
  claude: 2791
  approximate: 2830

Idempotency Rules

Enrichment pipelines SHOULD follow these rules:

Skip if provenance exists. If provenance.summary already exists, do not regenerate the summary unless the caller passes a force flag.
Force to overwrite. A force flag regenerates and overwrites the field and updates the provenance record.
Never overwrite human-authored fields. If a field has no provenance record, it is human-authored. Do not overwrite it automatically — even with force. Require explicit human confirmation.
Recompute on content change. If the content body changes (detectable via content_hash), all auto-generated fields SHOULD be flagged as stale. The provenance record remains until regeneration.

Enrichment Cost Reference

From the ACP research synthesis (non-normative):

Enrichment cost: approximately $0.002 per article using GPT-4o-mini or Claude Haiku
Latency: 0.8–2.2 seconds per ACO
YAML frontmatter is approximately 18% more efficient than JSON for metadata

These figures are reference points. Actual costs and latency depend on content length, model selection, and implementation.