AGENTS.md
106 lines · 4.4 KB
AGENTS.md
Guidance for AI coding agents working in this benchmark repository.
This repo uses Oak
Use oak, not Git, for version control in this repository.
oak status
oak diff
oak commit # local checkpoint only; does not publish
oak desc --file /tmp/oak-branch-desc.txt
oak push --repo oak/benchmarks
Do not mutate canonical checkouts such as /Users/mrmrs/o/oak,
/Users/mrmrs/o/benchmarks, or /Users/mrmrs/o/oakspace. Work in an isolated
worker-mrmrs-* checkout and report the full worker path with every branch,
commit, and validation result.
Working On The Harness
- Shared measurement policy (subjects, env, tokens, row contract, command
semantics, stream adapters, reporting math) lives in
scripts/oakbench/. Change it there, once β never copy a policy into a lane script. - Read
CONTEXT.mdfor vocabulary anddocs/adr/for decisions before proposing structural changes. - Run the instrument tests before pushing:
python3 -m unittest discover -s tests. - Null means unmeasured (ADR-0002): never emit a fabricated zero for a signal the runner did not observe.
- Scenario and operation names are immutable identity (ADR-0005): changed semantics require new names or a version bump.
- Before running remote-backed lanes, source the repo-local remote config:
. config/bench-env.sh. Do not rely on~/.zshrc; non-interactive agent shells often do not load it.
Finishing A Task
Before handing work back, leave the repo in a reviewable, server-visible state:
oak status
python3 -m unittest discover -s tests
oak commit
oak desc --file /tmp/oak-branch-desc.txt
oak push --repo oak/benchmarks
oak commit is local-only. Use oak push after validation so reviewers can
see the branch. Do not merge your own branch unless the user explicitly asks
for a safe-merge/land flow.
Improving Oak Against These Benchmarks (swarm runbook)
You are probably here to make Oak (../oak) faster and cheaper for agents.
The loop:
python3 scripts/opportunities.pyβ the ranked attack list generated from the latest results. Pick a target from the top; don't re-derive priorities from raw JSONL.- Edit Oak in
../oak, build it (cargo build --releasethere). python3 scripts/devloop.pyβ one command, one verdict: PASS or REGRESSED for your changeset vs the Oak baseline.- Only claim a win that devloop's printed detection limit supports.
Claim discipline (non-negotiable):
- Never claim a delta below the noise floor. devloop prints its own
per-lane detection limits; the default
--runs 2can only catch large changes. Chasing a 3% win? Bump--runsuntil the limit is below the effect you claim, or the claim is noise. - A cheaper output that loses information is a regression, not a win.
The gates enforce this:
information_recall< baseline or lostpipe_compatible_unifiedstructure fails the verdict (RECALL / PIPE-COMPAT lines). Don't try to win tokens by omitting changed-file names or unified-diff hunks. - Never aggregate across instruction levels or transports.
remote.net.*rows (real server, network) are oak-vs-previous-oak trend evidence only β they are deliberately named differently from git's local-fileremote.*ops and must never be compared against them. - Measurement serializes; building parallelizes. Lanes take a
cross-process measurement lock so concurrent agents don't poison each
other's timings. If your run waits on the lock, that's correct behavior β
do your editing/building while waiting, never set
OAK_BENCH_LOCK=offon a shared machine. - Run with your own results dir when working in parallel
(
--results results/<your-task-slug>), and leaveresults/latest.*to serialized verdict runs. - Skips are work items, not noise. A returncode-77 row names exactly what infrastructure would unlock a measurement (see the "Unmeasured" section of opportunities.py). Unlocking coverage is as valuable as improving a number.
Benchmark Data Hygiene
- Do not commit raw real-agent transcripts, generated result JSONL, run artifacts, or temporary workspaces.
- Keep benchmark scenarios, scripts, configs, dashboard schema, and docs in the repo.
- Put large/generated benchmark outputs in external storage or ignored local directories.
- Treat
--agent-environment minimalas the default for publishable token comparisons; uselocal-defaultonly when intentionally measuring local agent customization overhead.