Skip to content

Are We Drifting? — Part 2: The Tagged Vocabulary

Are We Drifting? — Part 2: The Tagged Vocabulary

Section titled “Are We Drifting? — Part 2: The Tagged Vocabulary”

Part 1: The Drift Problem ended on a shape: one source → many . This part is about the source — specifically, the smallest, most boring unit that the entire pattern is built from.

Start with the most innocent thing in any codebase: a closed set of named values.

A video is uploading, processing, ready, failed, or archived. A ticket is proposed, accepted, in_progress, or shipped. An upload was accepted or rejected.

These look harmless. They are strings. Everyone “knows” them.

That is exactly why they drift.

The backend writes "ready". The frontend dropdown lists "Ready". The analytics event sends "READY". A migration’s CHECK constraint allows 'ready' but a later refactor adds 'done' on the API side and nobody updates the constraint. Three months later a row exists that the frontend cannot render, because the value in the database is a status the UI has never heard of.

No single change was wrong. The set was simply written down in six places, and the six copies wandered.

Here is the move that makes the rest of the series possible.

A closed set of named values is not a string convention. It is a : a first-class, declarable thing with a precise shape.

Every entry in a tagged vocabulary has three layers.

Layer 1 — Identity
name the canonical identifier (how code refers to it)
wire the serialized form (how it crosses a boundary)
deprecated / deprecation_note
Layer 2 — Presentation
label the human-facing string
description optional documentation
Layer 3 — Kind-specific metadata
whatever this kind of vocabulary additionally needs

The first layer is the load-bearing one, and the split inside it is the whole trick.

Why name and wire are different on purpose

Section titled “Why name and wire are different on purpose”

Code refers to a value by its name. The boundary — JSON, a database column, an event payload, a model’s structured output — sees its .

Keeping them separate means the serialized representation is a deliberate, declared decision, not an accident of how someone happened to spell a constant.

This is a real vocabulary from our backend — the lifecycle of an uploaded video:

kind: enum
output: "../enums"
module: media
name: VideoStatus
entries:
- name: Uploading
wire: uploading
label: Uploading
- name: Processing
wire: processing
label: Processing
- name: Ready
wire: ready
label: Ready
- name: Failed
wire: failed
label: Failed
metadata:
terminal: true
- name: Archived
wire: archived
label: Archived

Failed carries one piece of kind-specific metadata — terminal: true — because some consumer cares that it is an end state. Everything else is pure identity and presentation.

That file is the source. It is the only place VideoStatus is defined. Every Go constant, every TypeScript union, every SQL CHECK constraint, every dropdown is a projection of it. We will watch those projections get generated in Part 4: Enums as Shared Vocabulary.

The reason this is worth elevating to a “primitive” is that it is not just enums.

Look at three vocabularies from the same backend, side by side. They are obviously the same shape.

An enum — identity, label, a flag:

- name: Failed
wire: failed
label: Failed
metadata:
terminal: true

An error — identity, label, plus the metadata a failure needs (a transport code, and the typed fields it carries):

- name: UploadTooLarge
wire: upload_too_large
label: Upload exceeds size limit
metadata:
code: invalid_argument
fields: [asset_id, size_bytes, max_bytes]

A metric — identity, label, plus the metadata an instrument needs (its kind and its labels):

- name: UploadIntentCreated
wire: media.upload.intent_created
label: Upload intents created
metadata:
instrument: counter
labels:
- name: outcome
values: [accepted, rejected]

Three different concerns — a state machine, a failure mode, an observability signal. One structure: a closed set of entries, each with a name, a wire, a label, and a bag of kind-specific metadata.

An enum is the base vocabulary. An error is that base plus failure metadata. A metric is that base plus instrument metadata. Generalize any of them down and you get the enum back.

Once you see it, you cannot unsee it. Permissions are a vocabulary (identity plus an action and a resource). Config keys are a vocabulary (identity plus a type and a default). Event types, job kinds, plan tiers, notification channels — every closed named set that crosses a boundary is the same atom.

The unification is not academic tidiness. It collapses N mental models into one — and that is worth the most when the contributor is an agent.

Without it, every closed set is a bespoke situation: enums are declared one way, errors another, config a third, each with its own conventions and its own places to look. A contributor — human or model — has to learn each one, and an agent generating code has N chances to invent a value inline because it did not know the convention for that particular kind.

With it, there is exactly one rule, and it fits in a sentence:

If it is a closed, named set of values that crosses a boundary, it is a vocabulary. Vocabularies live in manifests. Code uses the generated constants. Inventing a value inline is a lint failure, not a style nit.

Every repeated question now has a single answer. What are the allowed values for X? — the manifest for X. How do I add one? — add an entry, regenerate, commit. How do I deprecate one? — mark it deprecated in the manifest. Where is this value used? — follow the projections from the manifest.

Not shared vocabulary — a canonical language

Section titled “Not shared vocabulary — a canonical language”

“Shared vocabulary” undersells what this is. A vocabulary declared once is the canonical language of the whole system, and it is spoken by three audiences at the same time: the systems that serialize the value across a wire, the humans who name it and argue about it in review, and the agents that write against it.

The third audience is the one that changes the stakes. An agent does not only inherit the vocabulary by compiling against generated constants — it can discover it. The same closed sets are reachable at runtime: an agent asks for the live set of backlog owners or statuses over MCP rather than hard-coding them, and the operations a service exposes are themselves introspectable, so the agent reads the contract instead of guessing it. The vocabulary is not just baked into the binary; it is a queryable, shared language.

A vocabulary that humans, systems, and agents all draw from — and that agents can discover at runtime — is not a naming convention. It is the the whole codebase speaks.

That is why the atom matters out of proportion to its size. Get the closed-set-declared-once right, and you are not deduplicating strings — you are giving every participant, silicon or human, the same words.

The coherence check this removes is the one nobody schedules and everybody pays: did the value I just used actually exist, spelled exactly that way, on every side that has to agree about it?

When that check is a lint rule reading from a manifest, the answer is structural. An agent can add a status, regenerate, and ship — and the moment it references a value that is not in the vocabulary, the build says so, before review, before production.

The expensive part was never typing the enum. It was the standing tax of trusting that the six copies still agreed. Naming the atom is what lets a machine pay that tax for you.

We have the atom. Part 3: buildmere, a Codegen Kernel introduces the machine that turns it into code on every side: buildmere, a small codegen kernel where each kind of vocabulary is a plugin — and where the drift gate is one make target.