Log in
CONTEXT.md 33 lines · 4.0 KB

Domain Glossary

The vocabulary this repository uses for benchmark design, code, docs, and reports. Use these terms exactly; do not introduce synonyms.

TermMeaning
SubjectA VCS binary under measurement: git, oak_installed (released Oak), oak_local (current changeset), oak_main, or a tuned-Git variant such as git_fsmonitor. Subjects have two identity layers: logical name for trend grouping, and binary+version+source-hash for provenance.
LaneOne harness with its own scenario style and row shape: core (bench.py, isolated VCS operations), workflow (workflow_ab.py, deterministic scripted recipes), agent (agent_workflow.py, real coding-agent CLIs), contention (parallel_contention.py, N concurrent workers), mount (mount_probe.py, lazy hydration).
ScenarioA named, versioned workload shape within a lane (tiny_text, bugfix_test_loop, contention_shared_checkout_w8). Scenario names are immutable identity: changed semantics require a new name (see ADR-0005).
OperationOne timed unit within a scenario (status.dirty, vcs.snapshot, parallel.total). Same immutability rule as scenarios.
TrackThe command-semantics contract for a run: agent-default (what an agent would call today) or core-equivalent (closest matching semantic output level across subjects). The contract lives in config/command_semantics.json.
RowOne JSONL object for one measured operation. The durable source of truth; summaries and dashboards are derived. Null metric values mean unmeasured, never zero (ADR-0002).
FixtureThe deterministic file tree a scenario runs against. Fixture generation is untimed. Fixture content is versioned (fixture_version); changing bytes requires a version bump.
OracleThe deterministic pass/fail evaluation of a run: validators, changed-file shape, clean final VCS state. The agent's final message is never an oracle.
ProfileA size tier for the core lane: smoke, standard, large.
Instruction levelHow much VCS-specific instruction an agent prompt carries: zero-shot, cheat-sheet, full-docs. Measures the model-familiarity tax. Rows are never aggregated across levels.
TurnOne model inference round-trip in the agent lane. Turns dominate agent cost because each re-pays the transcript as input tokens.
Agent-emitted / agent-ingestedBilling-direction token accounting: text the model emits (commands, prose) is output-priced; text it reads back (tool results) is input-priced. cost_weighted_total combines them.
Admitted outputThe capped portion of command stdout/stderr treated as agent-visible context for token estimates. Full byte counts are always recorded alongside.
Determinism probeA diagnostic row pair checking byte-stability of identical-state output (prompt-cache friendliness) and ANSI pollution in non-TTY output.
Skip rowA row with returncode: 77 and skip_reason, recording that a measurement could not run (missing capability, no remote). Skips are recorded, never hidden.
Measurement sourcePer-metric-group string saying how values were obtained (adapter stream, transcript classification, filesystem probe). Accompanies the null-honesty contract.
BaselineThe comparison anchor: Git baseline (git), Oak baseline (oak_main if present, else oak_installed). The release-blocking comparison is target (oak_local) vs Oak baseline.
Efficiency gateA CI regression threshold on exact metrics (tool calls, output bytes) applied to the Oak-baseline comparison only (ADR-0004).

Architecture vocabulary

The shared measurement primitives live in the oakbench package (scripts/oakbench/): subjects, environment policy, token accounting, timed execution, fixtures, the row contract, stream adapters, command semantics, and reporting math. Lanes are thin scenario definitions over that core. When a measurement policy changes, it changes in oakbench, once.