Skip to content

Are We Drifting? — Part 6: Config and Metrics

Are We Drifting? — Part 6: Config and Metrics

Section titled “Are We Drifting? — Part 6: Config and Metrics”

The vocabularies in Part 4: Enums as Shared Vocabulary and Part 5: The Error Manifest described a service’s data. This part covers its edges — the environment it trusts on the way in, and the signals it emits on the way out. Both drift. Both are the same atom.

Config drift is the classic one. An environment variable is read with os.Getenv in three different files, with three slightly different defaults, and one of them forgets the variable is required. The .env.example that is supposed to document what the service needs was last updated two features ago. The service boots fine in dev and falls over in staging because a variable nobody remembered is missing — and the failure is a nil deref deep in a request, not a clear “you forgot DATABASE_URL.”

Metrics drift the same way. A counter is incremented with a string literal "media.upload.intent_created", and the dashboard queries media.uploads.intent_created — one word apart, silently graphing nothing. A label gets a new value that no one added to the alert.

Both are closed sets that cross a boundary — the process boundary for config, the telemetry boundary for metrics. Both want to be vocabularies.

The same three-tier wall that governs the enums and the error manifest governs these edges too: a lint-error for the things that happen daily, a compile assertion for the seam that must never rot, and a CI coverage script for the inventory that drifts on its own schedule.

A config entry is the familiar atom: a name (how Go refers to it), a wire (the environment variable), a label, and metadata describing its type, default, and validation. Here are a few real entries from our API’s config manifest:

kind: config
source: env
module: api
name: Config
entries:
- name: Port
wire: PORT
label: HTTP listen port
metadata:
type: string
default: "8080"
group: server
- name: DatabaseURL
wire: DATABASE_URL
label: PostgreSQL connection URL
metadata:
type: string
group: database
validation:
required: true
- name: S3SecretAccessKey
wire: S3_SECRET_ACCESS_KEY
label: S3 secret access key
metadata:
type: string
secret: true
group: s3

The interesting part is that validation is itself declarative. The manifest does not just list variables; it states the rules between them, in a small vocabulary of its own:

- name: SupabaseJWKSURL
wire: SUPABASE_JWKS_URL
metadata:
validation:
one_of_group: supabase_jwt # exactly one of this group must be set
required_when:
field: AppEnv
values: [staging, production] # …but only in these environments
- name: S3PublicEndpoint
wire: S3_PUBLIC_ENDPOINT
metadata:
validation:
must_match_or_empty_when: # must equal S3Endpoint, or be empty,
field: S3Endpoint # when running in staging/production
when:
field: AppEnv
values: [staging, production]

required, required_when, one_of_group, must_match_or_empty_when — the relationships that usually live in a paragraph of onboarding docs (or in nobody’s head) are declared facts. From this one manifest, buildmere generates:

  • a typed Go Config struct, so config access is cfg.DatabaseURL, never a stray os.Getenv;
  • a Load() that reads the environment and a Validate() that enforces every rule above;
  • a .env example file, so the documented environment cannot drift from the required one;
  • an env-check command that validates a real .env against the manifest.

That last pair is the anti-drift hinge. The example and the validator come from the same source as the struct, so “what the service needs,” “what the example shows,” and “what the startup check enforces” are guaranteed to be the same list. A missing required variable fails at boot with a clear message — or fails env-check in CI before it ever boots.

The CI tier closes that gap explicitly: env-check runs the config manifest’s declared requirements against a real .env and fails the build when the two lists disagree — a variable the service stopped reading still documented in the example, or a newly required variable not yet added.

A metric entry is the same atom with instrument metadata. Here is the real media-metrics manifest:

kind: metric
output: ".."
module: media
name: MediaMetrics
entries:
- name: UploadIntentCreated
wire: media.upload.intent_created
label: Upload intents created
metadata:
instrument: counter
labels:
- name: content_type
- name: outcome
values: [accepted, rejected]
- name: TranscodingDuration
wire: media.transcoding.duration
label: Transcoding job duration
metadata:
instrument: histogram
unit: s
labels:
- name: status
values: [success, failure]
- name: ActiveUploads
wire: media.uploads.active
label: In-progress uploads
metadata:
instrument: updown_counter

Notice the nested vocabulary: a label’s values: [accepted, rejected] is itself a closed set, declared inline. The metric name (media.upload.intent_created) is the wire; it is written exactly once, here.

The generated Go is, again, zero-import — it depends only on context, and a Factory interface the project implements. This is verbatim:

// Code generated by buildmere; DO NOT EDIT.
package media
import "context"
// Factory builds instruments by name. Implementations live in the consuming
// project's metrics kit; the generated code never imports it.
type Factory interface {
Counter(name, desc string) interface {
Add(ctx context.Context, n int64, labels ...any)
}
Histogram(name, desc, unit string) interface {
Record(ctx context.Context, v float64, labels ...any)
}
UpDownCounter(name, desc string) interface {
Add(ctx context.Context, n int64, labels ...any)
}
}
var MediaMetrics = &mediaMetrics{}
// Register builds every instrument from f. Call once after the metrics
// backend is initialized.
func (m *mediaMetrics) Register(f Factory) {
m.uploadIntentCreated = f.Counter("media.upload.intent_created", "Upload intents created")
m.transcodingDuration = f.Histogram("media.transcoding.duration", "Transcoding job duration", "s")
m.activeUploads = f.UpDownCounter("media.uploads.active", "In-progress uploads")
}
// RecordUploadIntentCreated records Upload intents created.
// Labels: outcome (accepted | rejected)
func (m *mediaMetrics) RecordUploadIntentCreated(ctx context.Context, contentType string, outcome string) {
if m.uploadIntentCreated == nil {
return
}
m.uploadIntentCreated.Add(ctx, 1, "content_type", contentType, "outcome", outcome)
}

Two things matter here.

First, the call site is typed: RecordUploadIntentCreated(ctx, contentType, outcome). You cannot fat-finger the metric name — it is baked into the generated method — and you cannot forget a label, because the labels are parameters. The string "media.upload.intent_created" exists in exactly one place in the whole codebase.

Second, the generated code imports no OpenTelemetry. It defines a Factory interface and depends on that. The project’s own metrics kit implements the interface (metricskit.BuildmereFactory) — the ~60-line adapter from Part 3: buildmere, a Codegen Kernel — and buildmere ships a compile-time assertion that the adapter satisfies the generated Factory. The instrument set is portable; the binding to OTel is the project’s, and the compiler checks the seam.

That seam is held by a zero-cost Go idiom: var _ Factory = (*metricskit.BuildmereFactory)(nil). That line compiles to nothing and refuses to compile at all if the adapter ever falls out of sync with the generated interface — no test to write, no CI script to wire, just a fact the compiler checks on every build.

The manifests govern the names and shapes that cross the boundary. The remaining gate governs the values — the rule that nothing structured gets flattened into a raw string on the way out. That is what sloglint in apps/api/.golangci.yml enforces, across three properties, each added because a specific class of log corruption had already happened or was structurally guaranteed the moment an agent writes a handler without knowing the rule.

static-msg fires when a developer writes slog.Error("query failed for user " + userID) — the interpolation swallows the structure and makes the field invisible to a log query.

no-mixed-args fires when the call mixes positional and key-value args: slog.Error("query failed", err, "athlete_id", id) looks reasonable at a glance, but the error is a positional arg, not a keyed one, and sloglint rejects it.

context: scope is the one with the longest tail — it requires slog.ErrorContext (the *Context variant) any time a ctx context.Context is already in scope, because trace IDs travel through context and a plain slog.Error in a handler throws that thread away.

Three lint rules. Three specific call-site shapes. Each one fires before the PR lands.

The natural extension is an otelslog bridge — feeding those structured records straight into spans when an OTLP endpoint is configured. The slog surface is the current one, and the bridge is where this goes next.

Drift scenarioNamed gateTier
.env.example drifts from what the manifest declares requiredenv-check against the manifestCI · coverage-script
New metricskit adapter does not satisfy the generated Factory interfacevar _ Factory = (*metricskit.BuildmereFactory)(nil)compile-enforced
slog.Error("query failed for user " + userID) — interpolated message swallows structuresloglint: static-msg · apps/api/.golangci.ymllint-error
slog.Error("query failed", err, "athlete_id", id) — positional arg mixed with key-value pairsloglint: no-mixed-args · apps/api/.golangci.ymllint-error
slog.Error(...) called while ctx context.Context is in scopesloglint: context: scope · apps/api/.golangci.ymllint-error

One exclusion worth naming: internal/programs is excluded from all linters pending its Connect-RPC rewrite — TODO(programs-rpc) — so the three sloglint rules do not fire there today.

For config, the check this removes is the onboarding-and-2am one: which environment variables does this service actually need, and which are required where? That is now the manifest, enforced at boot and in CI. Nobody greps for os.Getenv to reconstruct the answer.

For metrics, it removes the silent-dashboard check: does the name I emit match the name I graph, and did I pass the right labels? The name is written once and the call is typed, so the emit side cannot drift from the declared signal.

In both cases the edge of the service is described in one declarative place, and the generated code plus the gate make the description binding.

That closes the vocabularies arc — the one source → many projections → drift gates shape applied to data, failures, environment, and telemetry. Part 7: Models at Boundaries zooms out from closed value sets to whole models, and the discipline that keeps a “project” or a “workout” from becoming one overloaded struct smeared across the database, the API, and the frontend.