Log in
AGENTS.md 106 lines · 4.4 KB

AGENTS.md

Guidance for AI coding agents working in this benchmark repository.

This repo uses Oak

Use oak, not Git, for version control in this repository.

oak status
oak diff
oak commit          # local checkpoint only; does not publish
oak desc --file /tmp/oak-branch-desc.txt
oak push --repo oak/benchmarks

Do not mutate canonical checkouts such as /Users/mrmrs/o/oak, /Users/mrmrs/o/benchmarks, or /Users/mrmrs/o/oakspace. Work in an isolated worker-mrmrs-* checkout and report the full worker path with every branch, commit, and validation result.

Working On The Harness

  • Shared measurement policy (subjects, env, tokens, row contract, command semantics, stream adapters, reporting math) lives in scripts/oakbench/. Change it there, once β€” never copy a policy into a lane script.
  • Read CONTEXT.md for vocabulary and docs/adr/ for decisions before proposing structural changes.
  • Run the instrument tests before pushing: python3 -m unittest discover -s tests.
  • Null means unmeasured (ADR-0002): never emit a fabricated zero for a signal the runner did not observe.
  • Scenario and operation names are immutable identity (ADR-0005): changed semantics require new names or a version bump.
  • Before running remote-backed lanes, source the repo-local remote config: . config/bench-env.sh. Do not rely on ~/.zshrc; non-interactive agent shells often do not load it.

Finishing A Task

Before handing work back, leave the repo in a reviewable, server-visible state:

oak status
python3 -m unittest discover -s tests
oak commit
oak desc --file /tmp/oak-branch-desc.txt
oak push --repo oak/benchmarks

oak commit is local-only. Use oak push after validation so reviewers can see the branch. Do not merge your own branch unless the user explicitly asks for a safe-merge/land flow.

Improving Oak Against These Benchmarks (swarm runbook)

You are probably here to make Oak (../oak) faster and cheaper for agents. The loop:

  1. python3 scripts/opportunities.py β€” the ranked attack list generated from the latest results. Pick a target from the top; don't re-derive priorities from raw JSONL.
  2. Edit Oak in ../oak, build it (cargo build --release there).
  3. python3 scripts/devloop.py β€” one command, one verdict: PASS or REGRESSED for your changeset vs the Oak baseline.
  4. Only claim a win that devloop's printed detection limit supports.

Claim discipline (non-negotiable):

  • Never claim a delta below the noise floor. devloop prints its own per-lane detection limits; the default --runs 2 can only catch large changes. Chasing a 3% win? Bump --runs until the limit is below the effect you claim, or the claim is noise.
  • A cheaper output that loses information is a regression, not a win. The gates enforce this: information_recall < baseline or lost pipe_compatible_unified structure fails the verdict (RECALL / PIPE-COMPAT lines). Don't try to win tokens by omitting changed-file names or unified-diff hunks.
  • Never aggregate across instruction levels or transports. remote.net.* rows (real server, network) are oak-vs-previous-oak trend evidence only β€” they are deliberately named differently from git's local-file remote.* ops and must never be compared against them.
  • Measurement serializes; building parallelizes. Lanes take a cross-process measurement lock so concurrent agents don't poison each other's timings. If your run waits on the lock, that's correct behavior β€” do your editing/building while waiting, never set OAK_BENCH_LOCK=off on a shared machine.
  • Run with your own results dir when working in parallel (--results results/<your-task-slug>), and leave results/latest.* to serialized verdict runs.
  • Skips are work items, not noise. A returncode-77 row names exactly what infrastructure would unlock a measurement (see the "Unmeasured" section of opportunities.py). Unlocking coverage is as valuable as improving a number.

Benchmark Data Hygiene

  • Do not commit raw real-agent transcripts, generated result JSONL, run artifacts, or temporary workspaces.
  • Keep benchmark scenarios, scripts, configs, dashboard schema, and docs in the repo.
  • Put large/generated benchmark outputs in external storage or ignored local directories.
  • Treat --agent-environment minimal as the default for publishable token comparisons; use local-default only when intentionally measuring local agent customization overhead.