CONTEXT.md
33 lines · 4.0 KB
Domain Glossary
The vocabulary this repository uses for benchmark design, code, docs, and reports. Use these terms exactly; do not introduce synonyms.
| Term | Meaning |
|---|---|
| Subject | A VCS binary under measurement: git, oak_installed (released Oak), oak_local (current changeset), oak_main, or a tuned-Git variant such as git_fsmonitor. Subjects have two identity layers: logical name for trend grouping, and binary+version+source-hash for provenance. |
| Lane | One harness with its own scenario style and row shape: core (bench.py, isolated VCS operations), workflow (workflow_ab.py, deterministic scripted recipes), agent (agent_workflow.py, real coding-agent CLIs), contention (parallel_contention.py, N concurrent workers), mount (mount_probe.py, lazy hydration). |
| Scenario | A named, versioned workload shape within a lane (tiny_text, bugfix_test_loop, contention_shared_checkout_w8). Scenario names are immutable identity: changed semantics require a new name (see ADR-0005). |
| Operation | One timed unit within a scenario (status.dirty, vcs.snapshot, parallel.total). Same immutability rule as scenarios. |
| Track | The command-semantics contract for a run: agent-default (what an agent would call today) or core-equivalent (closest matching semantic output level across subjects). The contract lives in config/command_semantics.json. |
| Row | One JSONL object for one measured operation. The durable source of truth; summaries and dashboards are derived. Null metric values mean unmeasured, never zero (ADR-0002). |
| Fixture | The deterministic file tree a scenario runs against. Fixture generation is untimed. Fixture content is versioned (fixture_version); changing bytes requires a version bump. |
| Oracle | The deterministic pass/fail evaluation of a run: validators, changed-file shape, clean final VCS state. The agent's final message is never an oracle. |
| Profile | A size tier for the core lane: smoke, standard, large. |
| Instruction level | How much VCS-specific instruction an agent prompt carries: zero-shot, cheat-sheet, full-docs. Measures the model-familiarity tax. Rows are never aggregated across levels. |
| Turn | One model inference round-trip in the agent lane. Turns dominate agent cost because each re-pays the transcript as input tokens. |
| Agent-emitted / agent-ingested | Billing-direction token accounting: text the model emits (commands, prose) is output-priced; text it reads back (tool results) is input-priced. cost_weighted_total combines them. |
| Admitted output | The capped portion of command stdout/stderr treated as agent-visible context for token estimates. Full byte counts are always recorded alongside. |
| Determinism probe | A diagnostic row pair checking byte-stability of identical-state output (prompt-cache friendliness) and ANSI pollution in non-TTY output. |
| Skip row | A row with returncode: 77 and skip_reason, recording that a measurement could not run (missing capability, no remote). Skips are recorded, never hidden. |
| Measurement source | Per-metric-group string saying how values were obtained (adapter stream, transcript classification, filesystem probe). Accompanies the null-honesty contract. |
| Baseline | The comparison anchor: Git baseline (git), Oak baseline (oak_main if present, else oak_installed). The release-blocking comparison is target (oak_local) vs Oak baseline. |
| Efficiency gate | A CI regression threshold on exact metrics (tool calls, output bytes) applied to the Oak-baseline comparison only (ADR-0004). |
Architecture vocabulary
The shared measurement primitives live in the oakbench package
(scripts/oakbench/): subjects, environment policy, token accounting, timed
execution, fixtures, the row contract, stream adapters, command semantics, and
reporting math. Lanes are thin scenario definitions over that core. When a
measurement policy changes, it changes in oakbench, once.