/oak

Reviews

0 open branches

1 merged · 24h 86 merged · all-time

Open

merge when ready

Nothing here.

Merged

shipped recently

Merged 08251959 Refresh benchmarks/README.md as agent-native VCS marketing grounded in real coverage.

mrmrs 3hr ago

Merged dd17777d Fix branch-fleet benchmark oracle and cleanup failure metadata.

mrmrs 1d ago

Merged 313053a7 Updates benchmark repo agent instructions with explicit isolated-worker and finish-publish rules. Why: the stale zdgeier-d8eb3e branch tried to add a commit-and-push rule but was too old to land safely; this reapplies the useful AGENTS.md-only guidance on current main without deleting benchmark harness files.

mrmrs 3d ago

Merged 4a284352 Add launch claim specs for live branch-fleet evidence.

mrmrs 3d ago

Merged 28ec9a1e Update benchmark coverage after M7 mount evidence.

mrmrs 3d ago

Merged 0d4334d3 Add public claim gates that rederive launch numbers from raw JSONL.

mrmrs 3d ago

Merged 956532f0 Repair launch proof harness behavior for the new agent-native contract. Agent-state JSON probes now declare minimum Oak 0.98.0 and emit explicit capability skips for older installed binaries instead of failing required-field checks; branch-fleet and sync live lanes preflight commit --push before setup; mount finish passes an absolute desc-file path; and partial-destination mount recovery is reported as an explicit unsupported skip/finding. Validation: python3 -m unittest discover -s tests; py_compile on changed scripts; core compact rerun exits 0; mount_probe rerun exits 0.

mrmrs 4d ago

Merged a95a262a Accelerate branch_fleet live setup with bounded parallel seed and cleanup while preserving one fleet.seed row and the measured classify/plan/apply/sync/oracle workflow. Records worker counts, global failure indices, and cleanup diagnostics so n100 validation is practical without changing the oracle contract.

mrmrs 4d ago

Merged 1d8dcb61 Harden and accelerate branch_fleet live seed in oak/benchmarks. The Oak live seed path now collapses per-branch commit+push into explicit oak commit --push, emits per-branch and conflict-advancer progress, records seed diagnostics/failure metadata, retries transient 502/503/504 seed failures, wraps timeouts as failed seed rows, scopes cleanup to remote-visible disposable branches, and keeps classify/plan/apply/sync/oracle skipped after partial setup.

mrmrs 4d ago

Merged a9d9c0c3 Use checkout-free `oak merge <branch>` in branch-fleet apply.

mrmrs 5d ago

Merged 09bf4468 Keep full machine-parse stdout for branch-fleet classification.

mrmrs 5d ago

Merged ec7a90bf Update benchmark semantics for Oak's explicit publish contract.

mrmrs 5d ago

Merged 3c10ba56 Fix 16 benchmark harness correctness bugs

mrmrs 5d ago

Merged 137d0916 Build and harden the propose-mode overnight perf-improvement loop harness for oak/benchmarks.

mrmrs 5d ago

Merged b3c75a34 Benchmark summary transport guard: workflow A/B summaries now suppress elapsed-time deltas when successful workflow.total rows use different remote/workspace transports (for example Git local_file vs Oak network sync workflows), while still showing each subject's raw elapsed ms and preserving token/tool-call deltas. Adds regression coverage so transport-dependent sync rows cannot be reported as misleading Oak-vs-Git speed deltas.

mrmrs 5d ago

Merged 1e6c31fa Expose branch-fleet workflow-only timing and add a read-only Oak field-map helper.

mrmrs 5d ago

Merged 53fed5b6 Make devloop exercise Oak clone/push storm as part of the agent-scale reliability gate.

mrmrs 5d ago

Merged 97eab9b8 Add branch_fleet_nN platform benchmarks for agent branch-fleet workflows, with failure-mode hardening.

mrmrs 6d ago

Merged 9cbec33c Repair benchmark harness preflights, fixture limits, and failure diagnostics

mrmrs 6d ago

Merged ece1c3e8 bench: branch-triage fixture v2 labels and risk metrics

mrmrs 6d ago

Merged 34da23f1 Add result delta report helper

mrmrs 6d ago

Merged 57a43536 Add branch-triage-shape fixture, metrics, and lane tracer bullet

mrmrs 7d ago

Merged f48e0d3b Measure compact Oak agent state JSON

mrmrs 8d ago

Merged 9c6460ae Refine agent-native JSON probe: separate token accounting from JSON validation. run_timed gains opt-in full_output_bytes that captures stdout_full_text alongside the normal admitted (capped) stdout_text. The JSON probe now measures tokens/bytes on the admitted --admitted-output-chars window (consistent with every other op, fixing the prior interim fix that measured JSON probes on the full window) while validating the JSON oracle against the full capture (so a >20k branch review/diff JSON parses whole — the original truncation false-failure stays fixed). Full capture is bounded at 64MB (checked before reading: no OOM); over the bound, the probe is honestly UNMEASURED (json_validation_source null, json_* null, json_validation_unmeasured_reason set, returncode stays 0) rather than a false parse-failure — the final gate is 'json_oracle_passed is False' so unmeasured never trips returncode 1. Rows stamp json_validation_source (full_stdout_capture | admitted_stdout_capture | null). Two regression tests lock it: admitted-truncated-but-fully-validated keeps token accounting on the admitted window; over-cap full capture is unmeasured not failed. 752 tests green. Supersedes the interim 8f793b35 by restoring uniform token accounting.

mrmrs 8d ago

Merged 8f793b35 Fix agent-native JSON probe truncating large oak JSON into false parse-failure rows: the probe captured with the 20k --admitted-output-chars window, so oak branch review/diff --json (~67KB on high-cardinality branches) was truncated mid-document and the oracle recorded an oak parse failure even though oak emitted complete valid JSON (hit both oak_installed and oak_local in standard runs). Now capability JSON is captured whole via INFO_PROBE_MAX_CHARS like the info probe, and if output ever exceeds even that window the probe emits an honest output_exceeded_capture_window skip (rc 77) instead of parsing truncated bytes and blaming the subject. General command runner unchanged (admitted-output cap is its intended byte-measurement policy). Two regression tests: >20k JSON now parses + captures with the full window; truncated output is an honest skip not a parse failure. 750 tests green.

mrmrs 8d ago

Merged ec21b5d3 Harden fake-provider publish gate and conflict oracle validation: publish_gate now rejects any driver=='fake-provider' or branch_triage_provider=='fake' row from publishable inputs independent of profile (enforced guard, not relying on the platform-not-public-core boundary); conflict_resolution_lane validates oracle schema (expected_conflict/conflicted_paths/resolution/exactly-one-content-form/post_merge_check) and emits honest returncode-77 skip rows on malformed oracles instead of crashing; text resolution content pinned as exact UTF-8 bytes with no newline normalization (documented contract + CRLF-preserving test). Regression tests for all three. 748 tests green.

mrmrs 8d ago

Merged 98107fc8 Close scale-lane benchmark coverage gaps for LFS, netshape TTFD, and monorepo fixture status

mrmrs 8d ago

Merged 691cef97 Wire agent workflow VCS shim sidecar thrash metrics with null reasons

mrmrs 8d ago

Merged b582f31d Add conflict resolution workflow lane

mrmrs 8d ago

Merged 58b666fe Add minimal branch triage platform lane

mrmrs 8d ago

Merged d18c9c59 Harden platform lane MVP semantics rows and comparison keys

mrmrs 8d ago

Merged 16f312fb Verify content-integrity source strength in repro bundles

mrmrs 8d ago

Merged a1c3d8a4 Harden content integrity public-trust gating

mrmrs 8d ago

Merged 6b3c8ffe Aggregate content-attestation payload sources conservatively

mrmrs 8d ago

Merged f33d82ca Add Oak agent-native JSON capability probes

mrmrs 8d ago

Merged eb8b541d Enforce honest cold-cache state in mount lanes

mrmrs 8d ago

Merged d8a5057a Stamp benchmark rows with environment isolation provenance

mrmrs 8d ago

Merged 0e2857c3 Strengthen provenance hash coverage for core lane rows

mrmrs 8d ago

Merged b276e409 Reject negative baseline noise-floor overrides

mrmrs 8d ago

Merged 7e254e81 zdgeier-d8eb3e Zzdgeier 9d ago

Merged e75f001e Add canonical benchmark remote env file for agents

mrmrs 9d ago

Merged 06ab0f78 Large-binary/LFS tuned mode + netshape TTFD op, reviewed and hardened before landing. git_lfs mode: applies_to enforced from config/git_modes.json (single source of truth; small-binary scenarios honestly mode_skipped), mode.setup.lfs_install/lfs_track rows mirroring fsmonitor's setup pattern, git_lfs in publish_gate TUNED_GIT_SUBJECTS (setup failures trip the gate; presence not required since LFS coverage is scenario/host-dependent — REQUIRED_TUNED_GIT_SUBJECTS split pinned by test), track-coverage guard failing setup closed on untracked binary extensions (lfs_track_incomplete:<ext> — an untracked comparator is an unfair comparator), mocked success-path test with a git-lfs PATH shim + live skipUnless smoke. netshape_log_follow_cold (renamed pre-publication per ADR-0005 state-in-name convention): byteproxy TTFD fields verified lock-guarded, non-empty-chunk-only, reset per run; rows carry ttfd_ms/ttfd_source/first_payload_bytes_server_to_client as explicit nulls when unobserved; ttfd_semantics label cold_network_acquire_then_file_history_query on rows. 695 tests green (2 skips: stress-ng, live git-lfs).

mrmrs 10d ago

Merged 81528a03 GitGoodBench importer: accept the real HuggingFace CSV shape. The raw Lite CSV encodes the scenario column as a Python-literal dict string (pandas repr), not JSON, and carries an unnamed leading index column — parse JSON first then ast.literal_eval (literals only, never code), with BOTH parse paths now accepting only dict/list results so a scalar scenario skips as scenario_json_unparseable instead of going ready. Regression tests for the raw HF shape and the scalar guard; prior-art.md policy row updated to reflect the completed Apache-2.0 license review (clean-room no-code-copy rule retained) and the corrected schema note. Verified against the public Lite CSV first row: ready 1, skipped 0. 676 tests green.

mrmrs 10d ago

Merged 8c56ba02 GitGoodBench: proper source review + citation. prior-art.md gains the full REALM 2025 BibTeX (Lindenbauer/Bogomolov/Zharov, doi:10.18653/v1/2025.realm-1.19) and verified facts replacing secondhand guesses: real 12-column schema with sample_type merge|file_commit_chain, scenario as embedded JSON, HF distribution (900/120/17469 splits, Apache-2.0, 816 repos), upstream harness not runnable (proprietary code removed), their success-rate-only reporting vs our resource-priced gap. Importer modernized to the real schema: canonical sample_type mapping, file_commit_chain expands to interactive_rebase + iterative_commit scenarios, scenario JSON parsed liberally and embedded verbatim as source_scenario, JSONL input support, citation + source_datasets stamped in output (schema_version 2), legacy liberal mapping kept as fallback. 674 tests green.

mrmrs 10d ago

Merged 316e7330 Persist regression_dimensions end-to-end (review fix): comparisons insert binds it as a JSON string, schema_v2 adds the column additively and refreshes dashboard_comparisons to expose it, StubD1 ingest test proves a p90-only regression round-trips as '["p90"]' with is_regression 1; schema.sql + schema_v2.sql verified to apply cleanly against sqlite. 667 tests green + node parity 16/16.

mrmrs 10d ago

Merged eca97060 Fix 8 second-round review findings + 1 sibling: latest.jsonl pointer only advances after row-contract validation (raw file kept for emitter debugging), dashboard is_regression considers p50 OR p90 with regression_dimensions + severity from breaching dimension, re-upload deletes aggregates/comparisons alongside measurements (no stale groups), JS tail honesty mirrors TAIL_MIN_SAMPLES (p95 null under n=20, p99 under n=100, golden-parity asserted both runtimes), all 26 remaining int(row.get(returncode)) call sites across 10 scripts use row_returncode (+ process_returncode sibling in workflow_ab), thrash polling loops emit one in-place-updated event per loop start instead of re-emitting as the window grows, netshape teardown runs after mid-create failure (created flag set before first command, cleanup-then-reraise), netshape exit code computed over subject rows only (self-test row can no longer mask all-skipped). 667 tests green + node parity 15/15.

mrmrs 10d ago

Merged 61491464 Fix 10 external-review findings in evidence/verdict layers: repro_verify requires full raw_inputs coverage (incomplete repros cap at COMPARABLE), scorecard reports honest n=contributing-rows + n_successful and caps GREEN/RED at AMBER on partial metric coverage (ADR-0008 updated), fixture verify_fixture three-state status (missing/empty dir fails, manifest-null is unverifiable not verified — reviewer repro now exits 1), baseline campaign pins/measures/versions ONE resolved git binary, dashboard ingest handles subject-less platform rows via effectiveSubject (github/driver identity, skips counted never silent), subject identity keyed on subject_id so mixed binaries never collapse (schema_v2 unique index + documented supersession), comparisons computed against every git-kind baseline with explicit baseline_subject_id + separator-safe keys (order-independent), netshape oak path seeds the served repo before measuring with skip cascade on clone failure, non-race check-instant actually posts+settles status on api driver and honestly skips on cli driver (semantic_contract_match gated on executability), clone_push_storm rows record pushes_succeeded/clones_succeeded as observed zeros so starvation is measurable. 652 tests green + node parity 12/12.

mrmrs 10d ago

Merged aa73b9f7 Document infrastructure plan: control-plane (Cloudflare) vs measurement-plane (bare metal) split, hardware/GitHub/dataset/budget requirements keyed to the exact skip gates in the code, 7-step execution order

mrmrs 10d ago

Merged 4c6fad70 Phase 4: subject-kind plugin interface (config/subject_kinds git/oak/jj with parity_report proving measurement-identical extraction from command_semantics — git/oak parity empty; jj pilot with honest capability-gap nulls), near-declarative scenario_spec loader, lane-contract conformance suite (9 lanes × 5 checks with negative tests + meta-test forcing new lanes to register; reality-vs-plan exemptions discovered by grep and documented), dashboard ingest complete (rows→measurements/subjects upsert/aggregates/comparisons as exported pure functions, schema_v2 additive, scorecard/trends/baseline-book/verifications endpoints, HMAC webhook, Python/JS formula parity via shared golden fixture asserted from both runtimes — node 9/9), stats upgrades (MAD modified-Z flag-never-delete with first-run warmup hint, adaptive rep planning, BH-FDR detail, two-stage confirmation) + statistical-methodology.md, adding-a-subject.md, new-lane-template.md. 624 tests green (1 skip: stress-ng absent).

mrmrs 10d ago

Merged ebb35c36 Phase 3: netshape lane (netns/veth/tc-netem profiles with ±15% self-test rows, stdlib smart-HTTP gitserver proven by real clone+push+protocol-v2 tests, TRANSPORT_NETWORK_SHAPED + shaped_comparison_legal, macOS skip honesty), fleet scale (flock-guarded .partial.jsonl crash insurance, merge_train/rebase_storm/mixed_fleet/conflict_storm/clone_push_storm/long_divergence_k modes with queue-wait/attempts-to-land/poll-latency/lost-update accounting, fleet_report saturation knee + fairness + 5xx detection, existing rows byte-identical), stress-ng loadgen (--load-tier on core+workflow, always-timeout/temp-path/seed discipline, bogo-ops verify-only, skip rows when load unapplyable, envwatch boundary samples + environment_suspect on every row) + falsification tests, t/perf external runner (arm's-length GPLv2-clean aggregation, min-of-N sanity band, golden-fixture parser). 551 tests green (1 skip: stress-ng absent).

mrmrs 10d ago

Merged e85f6e6f Phase 2: repro bundles (repro_bundle.py pinning config/scenario/fixture/binary/noise-floor provenance + repro_verify.py with VERIFIED-EXACT/VERIFIED/COMPARABLE/DIVERGENT verdicts, exact machine-independent matching + ratio-CI overlap for latency), bench-conflict-corpus-v1 generator (9 deterministic conflict kinds incl. adjacent-lines clean-merge-broken-build with committed check.py oracle, self-tested per kind), dirty-tree spectrum (status/diff.dirty.p01/.p10/.p50 behind --dirty-spectrum, untimed seeded dirtying with byte-identical restore, flag-off path proven unchanged), VCS PATH shim (flock-safe sidecar JSONL, overhead measured never subtracted, thrash metrics populated: agent_blocked_on_vcs_ms/vcs_share_of_task_wall/thrash_events_count), integration_race_n1/n10 with rate-limit budgeter pacing recorded as race.pacing.injected + check-instant protection variant (harness-posted statuses). 433 tests green.

mrmrs 10d ago

Merged 3242fb8f Platform lane MVP: pr_single_anatomy + poll_until_merged_i10s + capability probe on protection-none, cli and harness-api drivers (api_metrics exact, token fields null by honesty), monotonic settle clock with poll quantization reporting, rate-limit budgeter (pacing recorded never subtracted), platform_semantics.json fairness contract with all six protection variants name-reserved, runner-stamped rows + platform lane row contract, skip-row honesty without credentials (exit 3), lane excluded from devloop. 380 tests green.

mrmrs 10d ago

Merged b2c66f0c Phase 1 remainder: runner discipline (cachectl purge+honest cache_state, envwatch boundary sampling + environment_suspect, runner_calibration micro-suite CLI, runner-discipline.md), token calibration scaffolding (versioned token_calibration.json uncalibrated-v0, oakbench/calibration.py reporting-layer conversion, token_calibration_campaign.py with tiktoken/heuristic methods), devloop --scorecard (informational target-TDD section vs CURRENT book), xl/monorepo fixture registry entries + make_monorepo_fixture.py deterministic fast-import generator, GitGoodBench-Lite importer with mirror-missing skip honesty. 349 tests green.

mrmrs 10d ago

Merged c7f98378 Phase 1 TDD loop: targets.json + Git Baseline Book loader/emitter + scorecard evaluator/CLI + baseline campaign CLI, with review fixes: noise-inflated GREEN/RED bounds per ADR-0008 (no instrument-luck greens), goal margin measured as distance from parity (fixes goal>1), null-returncode rows treated as failures not crashes, bootstrap default unified at 2000, baseline_book CLI auto-stamps runner identity (ADR-0007), shared metric_value/iter_jsonl/number_or_none (dedup), clear goal validation errors. 301 tests green.

mrmrs 10d ago

Merged 27cab567 Phase 0 benchmark foundation: runner identity, fixture registry, shared stats, byteproxy/hyperfine cross-check, thrash helpers, prior-art coverage, and baseline/scorecard ADRs.

mrmrs 10d ago

Merged 983e59ef Fix benchmark readiness accounting and harness gates

mrmrs 10d ago

Merged 77c6fbfd Add public benchmark readiness bundle and stricter publish gates

mrmrs 10d ago

Merged 0653ae77 Fix benchmark accounting and harness correctness bugs

mrmrs 10d ago

Merged 6d6796e0 mrmrs-9e4c0d

mrmrs 10d ago

Merged a015cd02 sync/divergence recovery + mount-vs-clone lanes. sync_* workflows (workflow_ab): push-divergence/pull-upstream/pull-dirty/non-FF-amend with remote_purpose ctx plumbing, per-run disposable branches, workflow skip rows; live-verified both subjects against oak/bench-sync-tmp. Probed contracts encoded in expected returncodes: git push-reject rc1 / pull-unconfigured rc128 / pull --rebase preserves work; oak 0.96 push rc5, suggested pull rc5, pull --force discards commits (redo steps = recovery cost), and oak pull on a dirty tree SILENTLY discards uncommitted edits (rc0, no warning) — verify.local_payload failing is that measurement. error_mentions_recovery_command on failure rows (git amend non-FF error omits --force-with-lease: captured). mount_vs_clone.py acquisition lane: oak_mount vs git full/shallow/blobless/sparse(3-call) on byte-identical mirrors, time-to-first-read + task-scoped hydration delta via diskprobe, skip-row path contract-validated. row_parity normalizes run timestamps + disposable-branch uniquifiers. Mirrors seeded: github oakdotspace/bench-large-mirror main@f736927 (manifest-verified), oak/bench-large-mirror main single commit (mount verify pending: 2 startup timeouts >120s under load). 174 tests green.

mrmrs 10d ago

Merged 4522f806 mrmrs-baf5cf

mrmrs 10d ago

Merged 97594e90 mount-vs-clone live data landed (n=3/variant, SSH remote): git full 28.1s/670MB, shallow 21.1s/574MB, blobless 24.2s/583MB, sparse-task 7.1s/66MB/3calls to first read on the 351MB/28k-file mirror; oak_mount 0/3 — wide-tree 120s mount timeout REPRODUCED ON CLEAN FSKIT SLATE (leaked-mount hypothesis ruled out), P0 product finding for ../oak: mount startup doesn't scale to the repo class the lazy-hydration pitch targets. Coverage doc gates the oak-vs-git acquisition claim until mount comes up; git-only costs citable with network caveat. Raw lane data copied under results/ (ignored).

mrmrs 10d ago

Merged 15f60559 Branch-lifecycle findings: oak close never propagates to the server (verified via fresh-clone round-trips, by-name and as-current); every clone leaves a transient open branch in listings; merge is the only verified server-side branch remover.

mrmrs 10d ago

Merged 2c45d63d Accuracy & insight push, complete: Phase 0 remote plumbing (oakbench/remotes.py, diskprobe.py), agent-fleet lane live for oak (N mounts, time_to_nth_workspace curves, allocated-vs-visible marginal disk, merge phase + integrity; campaign hardening: partial-row flush, timeout-as-error-row), sync/divergence recovery workflows incl. non-FF-amend + sync.recovery_metrics + actionability, task-loop lane (lane 'task-loop'), real-agent adapter fixes (cursor camelCase usage fallback, goldens for claude/codex/cursor), instruction_level_report.py (zero-shot vs cheat-sheet), publish_gate.py (passing on tuned core + agent + task-loop results), plus the other session's mount-vs-clone live data docs, real-agent findings writeup, and COORDINATION.md. 189 tests green at merge.

mrmrs 10d ago

Merged fa5e945c tokens: price the command an agent types, not the harness's binary path

mrmrs 11d ago

Merged 50e81e22 workflow_ab: history lane measures oak's real file-history and pickaxe when the binary has them

mrmrs 11d ago

Merged 54ff4a1b command_semantics: oak core-equivalent diff is 'diff --print' (v2026-06-11.1)

mrmrs 11d ago

Merged b8a66469 bench-diff-contract

mrmrs 11d ago

Merged 10471910 Add prompts/oak-fix-handoff.md: the complete oak-product findings writeup from two days of benchmark instrumentation, shaped to hand to an agent fixing ../oak. Twenty items in four tiers with observed-verbatim evidence and per-item verification commands. P0 error-recovery traps: self-contradictory oak pull divergence message that recommends only the destructive --force path (the safe push-then-fetch path is never mentioned), branch rename can't find the current branch, same-name-collision error gives impossible advice, mount install success delivered as 'error: Server error:' with the wrong extension name (OakFS vs OakFSExtension), empty-repo mount dead-end with no seeding guidance, dirty-mount teardown refusal without printed recovery steps. P1 performance cliffs from first live mount numbers: 1.4-1.7s first-write-into-mount (then 66ms), warm mount no faster than cold ~1.0s, 281ms no-op fetch, push-dominated 1.2s lifecycle iterations with 250-330ms desc roundtrips. P2 output contracts: piped diff 0% unified-compatible (keep the compactness — recall is 1.0 — add structure when not a TTY), status/branch hot-path verbosity vs git, ASCII banner in agent-paid contexts, missing porcelain/name-only/quiet modes. P3: oak must self-report hydration bytes (FSKit allocation reporting makes disk-side measurement impossible; harness already parses 'hydrated/downloaded: N bytes' patterns), no fsck equivalent, history interrogation gaps (show/blame/pickaxe/file-log), oak finish unshipped (benchmark slot ready), CLI-Mount-app version coupling undefined, silent clean-tree commit. Closes with the don't-regress list: 11-16ms in-mount commits, 10x better lock-wait than git under contention, recall 1.0 at 45 bytes/file, one-call snapshots, content-independent mount cost. REWRITE both prompts with the CORRECT orientation (previous versions were backwards and broke three live agent runs): agents work IN a clone of oak/oak and run the benchmark harness AS THEY WORK — they are not launched from oak-benchmarks. benchmark-changeset.md: OAK_SRC = the oak clone you are working in (verify Cargo.toml); BENCH = your own sibling clone of oak/benchmarks, created once, run-only, never another agent's checkout; the loop is edit -> cargo build/test -> devloop --skip-build --oak-local-bin with the binary passed EXPLICITLY every time (defaults like ../oak point at nothing in this layout). optimization-orchestrator.md: session workspace contains the orchestrator's own bench clone plus per-worker dirs each holding BOTH an oak/oak clone (edit here) and a bench clone (measure here); orchestrator computes concrete absolute paths at spawn time and injects them into worker prompts — workers never guess paths and never touch a path they didn't receive; orchestrator re-runs every verdict itself from its own bench against the worker's binary before landing.

mrmrs 11d ago

Merged 19185486 Add prompts/optimization-orchestrator.md: the standing prompt for an autonomous optimization orchestrator (karpathy-autoresearch-style loop). It runs a tmux fleet of codex/claude/cursor workers against ../oak: baseline at n=10, generate ranked hypotheses from scripts/opportunities.py, assign file-disjoint experiments, judge each with devloop + both test suites + quality gates, LAND/KILL/ITERATE, append-only lab journal + leaderboard, explore/exploit split, landing protocol (per-win oak commits with lab-note descs, merge only on full green), cleanup duties (leaked mounts), and stop conditions with a final report. Encodes the session-learned worker launch flags and the hard rules: no claims below the printed noise floor, correctness suites must pass, recall/pipe-compat/determinism/integrity losses are regressions regardless of token wins, never compare network vs local-file transports, never tune the harness to flatter a number (harness fixes are separate changesets). Also adds prompts/benchmark-changeset.md: the prompt for an agent that changed ../oak and wants numbers. Key design: git is a CONTROL GROUP, not a moving part — measured once per harness version into results/baselines/git-smoke.jsonl with a staleness sidecar (benchmarks repo hash + host + command_semantics_version), then every experiment runs only oak_installed/oak_local and regression_report.py merges the baseline file with the fresh run to produce git-vs-installed-vs-changeset in one report. Cuts per-experiment measurement time roughly in half and keeps the git anchor statistically stable across a whole research session. Flow verified live: cross-file deltas compute correctly. Includes per-lane commands, the mount --oak-bin caveat, and the runbook claim-discipline rules. Orchestrator prompt gains a worker-isolation topology section: per-worker oak CLONES on native disk (each experiment is its own changeset; cargo builds through a FSKit mount are slow, leave mounts dirty, and add mount-layer variance that contaminates verdict attribution), verdicts via devloop --skip-build --oak-local-bin <worker-clone>/target/release/oak, oak-benchmarks checkout shared read-only. Deliberate exception: exactly one explore-lane worker loops INSIDE an oak mount (CARGO_TARGET_DIR outside) to dogfood mount friction into the journal — the swarm-in-mounts product thesis as a recorded experiment, not an unexamined default. FIX: both prompts were hardcoding this machine's absolute paths (/Users/mrmrs/o/oak-benchmarks, /Users/mrmrs/o/oak) — multiple orchestrators launched from different checkouts all converged on ONE checkout, switching its branches under each other and corrupting all their experiments. Prompts are now location-agnostic: BENCH = the repo the agent is launched in (verified by CONTEXT.md + scripts/oakbench presence, stop-and-ask otherwise), OAK_SRC = $OAK_REPO or BENCH/../oak (the harness's own resolution), RESEARCH = session-timestamped sibling dir so concurrent orchestrators never collide on worker clones, and an explicit rule: never assume another checkout's absolute path — crossing into another orchestrator's checkout corrupts both experiments.

mrmrs 11d ago

Merged cf9dba41 Mount lane live — first full run (2026-06-11). Setup completed: Oak Mount app v0.96.0 auto-installed, OakFS FSKit extension enabled in System Settings, disposable repo oak/oak-benchmarks-tmp seeded (README/AGENTS/docs/ + 64MB assets/big.bin) and merged to main — oak mount requires a non-empty default branch. Result: 54/58 operations measured. Scenario fixes in this changeset: oak mount end refuses dirty/unpushed mounts (verified product policy), so first_use_read_edit and status_diff_commit_inside_mount now commit+push before teardown — both scenarios had produced zero measurements before today (teardown was broken-by-construction), treated as scenario bug fix, not a rename; huge_file_path wired to assets/big.bin unlocking huge_file_partial_read. First-ever mount numbers (one host, n=1, treat as low-n): mount.start ~1.0s regardless of repo content (2-file repo and 64MB repo identical — lazy hydration confirmed time-wise), teardown ~0.5s, time-to-first-useful-work 1.07s, status/diff/commit INSIDE a mount 10-12ms (local speed), push from mount ~850ms, task lifecycle iteration 2.5s first then ~1.2s steady-state, first WRITE into a mount 1.4-1.7s then ~66ms (write-path warmup = top optimization target), warm mount.start not faster than cold (caching opportunity), space.clean 445ms and correctly tears down only clean+pushed mounts, 64MB file: first 4KB read in 825ms without pulling the file. Measurement caveat: FSKit reports allocated_tree_bytes ~= full logical size, so disk-side accounting cannot distinguish virtual from hydrated blocks — true bytes_hydrated needs oak self-reporting. interrupted_recovery's rc=6 refusal (mount into a dirty destination) is the recorded behavior, not a harness failure.

mrmrs 11d ago

Merged cd5acc0f Swarm-readiness infrastructure, built by a three-agent worker swarm (codex, claude, cursor-agent in tmux) under orchestration; all lanes file-disjoint, integrated and verified by the orchestrator. (1) MEASUREMENT LOCK (codex): oakbench/runlock.py — cross-process flock serializing benchmark measurement so concurrent agents don't poison each other's timings; auto-releases on process death, 30s progress notes naming the holder, OAK_BENCH_LOCK=off escape hatch (single-benchmark machines only), graceful degradation without fcntl. Integrated into workflow_ab/parallel_contention/mount_probe (codex) and bench.py (orchestrator; fixtures build OUTSIDE the lock — builds parallelize, measurement serializes). Run metadata records measurement_lock_wait_ms + held/disabled, null never fabricated. Live-verified: a bench run waited 3.4s behind a 4s holder. (2) OPPORTUNITIES SCOREBOARD (claude): scripts/opportunities.py turns latest results into the swarm's ranked attack list — where-oak-loses-to-git scored by pct-gap x op-family frequency weight (status/diff/commit/proc.spawn=10), absolute quality bar (pipe-compat, recall, determinism, ANSI), oak-only trend targets (remote.net.*, mount), unmeasured coverage grouped by skip_reason with unlock instructions, low-n flagging throughout; deterministic output, optional JSON; 13 tests. First report immediately surfaced: status.dirty +16.7% output bytes vs git, pipe-compat 0%, 25 mount measurements gated on the Mount app install. (3) PARALLEL HYGIENE (cursor-agent): unique per-process suffix (pid+token_hex) on disposable-remote branch names so same-second runs never collide on the server; ensure_fixture builds into a temp dir and atomically renames with lost-race fallback and marker-mismatch error; ResultsStore latest.* copies are atomic (temp + os.replace); 4 tests. (4) SWARM RUNBOOK (orchestrator): AGENTS.md section 'Improving Oak Against These Benchmarks' — the loop (opportunities -> edit ../oak -> devloop verdict) and claim discipline: no claims below the printed noise floor, recall/pipe-compat losses are regressions, never compare network vs local-file transports, measurement serializes while builds parallelize, per-agent results dirs, skips are work items. 131 tests passing. The swarm is launchable; remaining unlock is the Oak Mount app for v0.95.0 on this host.

mrmrs 11d ago

Merged a5f991c6 Reviewer-gap instruments: close the eight under-measured areas from the external agent review. (1) Memory: peak_rss_bytes on every timed command via os.wait4 rusage capture in oakbench/execution.py (no wrapper process in the timed path; bytes on all platforms; null where wait4 is unavailable, never zero). (2) True token cost: tool-call envelope accounting in oakbench/tokens.py — a call is never just command text; additive estimated_tokens_envelope_* and *_with_envelope fields price the tool-use/tool-result JSON framing (45 emitted / 27 ingested per call), constants validated by the new token_calibration.py --envelope mode against Anthropic/OpenAI templates. Historical token fields are byte-identical (row parity, ADR-0005). This makes git 2-call snapshots vs oak 1-call carry their real price. (3) Tail latency: tail_latency_summary (p50/p95/p99/max) in oakbench/reporting.py with sample honesty — p95 null below n=20, p99 below n=100, because nearest-rank below 1/(1-p) samples is the max relabelled; benchmark_stats.py grew p95/p99/max columns with the same rule. (4) Semantic value of output: *.inforecall probe rows in bench.py score status/diff output against the fixture ground-truth changed set — information_recall, bytes_per_changed_file_named; compact-but-lossy output now shows as recall < 1.0 instead of a token win. New summary section renders it. (5) Pipe compatibility: diff.full.inforecall always runs the core-equivalent full diff and records pipe_compatible_unified + hunk/file-header/binary-notice counts (oakbench/output_semantics.py). FINDING: oak full diff is 0% pipe-compatible vs git 100% — the compat risk is now a tracked metric. Counter-finding: oak recall stays 1.0 at ~45 bytes/file vs git diff.full ~8.4MB on binary scenarios — oak compactness is not lossy, and the large-file delta is now citable. (6) Remote/cold cache: remote.push.first / remote.clone.cold / remote.pull.uptodate core-lane ops against a local file:// bare remote (cache_state recorded per row; network jitter excluded by construction); oak emits returncode-77 skip rows until OAK_BENCH_REMOTE wiring lands; --skip-remote for large profiles; skips surface in a new summary section and never fail the run. (7) Mount lifecycle: mount_probe.py handlers for oak.desc, space.clean, and oak.finish (capability-gated via untimed --help probe — skip row on 0.95 binaries, lights up when oak ships finish), plus loop.push_desc (edit/commit/desc/push iterations with per-iteration rows and a .total row) and the task_lifecycle_loop scenario; note the hand-rolled spec parser rejects YAML folded scalars. (8) Compact-output behavior cost: vcs_info_followup_calls_total in the agent lane — a successful read-only VCS command immediately re-asked in the same family (status then status --long) means the first output was insufficient; claude-style streams resolve tool_use_id -> command via a pre-pass map; metric is null when commands are not adapter-visible (ADR-0002). (9) Contention tail: per-commit latency samples and attempts retained raw on worker rows; parallel.total carries commit_latency_tail + retry_depth_histogram (first run: 4 workers, only 8/24 git commits succeeded first attempt, p95 2.6x median). Tests: 90 passing (22 new in tests/test_new_instruments.py covering envelope math, tail honesty, recall/compat oracles, RSS capture, follow-up detection incl. unmeasured gating). Docs: README agent-efficiency + remote + contention sections, benchmark-coverage.md gains 9 evidence rows with public-claim rules, mount-benchmarks.md lifecycle section. Unblocked by config, not code: set OAK_BENCH_REMOTE / OAK_BENCH_MOUNT_REPO at a disposable repo and the oak remote + lifecycle rows start measuring. (14) Round-3 review nits: bytes_read_by_agent no longer fakes file reads with transcript output bytes — nullable in schema, null in this lane (the transcript number stays in command_output_bytes); vcs_metrics_from_tools now derives commands_total AND all family sub-counts from one command list through the shared oakbench.classify.vcs_subcommand (sub-counts can never disagree with the total again; new consistency tests); core-lane rows carry measurement_source. 109 tests passing. (15) Disposable remote wired (oak/oak-benchmarks-tmp): mount.yaml points at it with safe_push enabled (it exists to receive benchmark pushes — never point this at a repo whose history matters), and the core lane now measures REAL Oak server remote ops when OAK_BENCH_REMOTE is set — remote.net.push.first / remote.net.clone.cold / remote.net.fetch.uptodate. Distinct names from git's local-file remote.* ops on purpose (ADR-0005): network vs local-file transport are different measurements, so no report can fabricate a git-vs-oak delta between them; the oak rows answer the release-blocking oak-vs-previous-oak trend. Each run pushes a unique bench-<id> branch (untimed setup) because the server rejects same-named branches with unrelated histories — verified by reproducing the agent-task collision. First live numbers (micro, one host): push 446ms, cold clone 694ms, no-op fetch 281ms. Mount lane remains blocked on this machine: oak.space serves no Oak Mount app for v0.95.0 (HTTP 404) — install via 'make macos-app INSTALL=1' in the oak repo or fix the server publication; mount rows record the failure honestly and skip dependents. The disposable repo accumulates bench branches by design; clean it periodically or recreate it.

mrmrs 11d ago

Merged baca0e47 Two commits. (1) Architecture: extract the oakbench measurement core (scripts/oakbench/: subjects, environment, timed execution, token accounting, fixtures, per-lane row contract validated at write time, command-semantics contract as auditable data, stream-adapter seam, reporting math); lanes become thin scenario definitions, verified measurement-identical via row-level diffs; instrument test suite under tests/; CONTEXT.md glossary + ADRs 0001-0005; compare.py deleted. (2) The agent dev loop: scripts/devloop.py — one command, one verdict (PASS/REGRESSED exit code) for 'did this Oak changeset make things worse'; builds ../oak via cargo, measures the host noise floor with an A/A null test plus per-lane embedded null controls from identical-command steps, gates latency above measured noise, exact-metric efficiency gates vs the Oak baseline (ADR-0004), and fails only NEW Git-guardrail breaches (pre-existing gaps tracked, not failing). scripts/row_parity.py codifies ADR-0005 measurement-identity proofs for harness refactors. bench micro profile + per-lane integration tests (core/contention/mock-agent). Real-adapter verification: ran actual claude 2.1.170 and codex-cli 0.139 bugfix runs; found and fixed two reality gaps — claude stream-json requires --verbose (adapter never worked against the live CLI), and codex turn.completed is session-scoped with cache included in input_tokens (adapter now counts round-trips from agent_message items, never double-counts cache, reports peak context as unmeasured). Sanitized real transcripts promoted to tests/fixtures/*_real.jsonl and pinned by tests (68 total). docs/adding-benchmarks.md playbook; README dev-loop section; coverage table updated (cursor adapter marked unverified).

mrmrs 12d ago

Merged 25ec9ce8 Measurement-accuracy overhaul: schema v2 null-honesty contract (unmeasured = null + measurement_source, never fabricated zeros); turn metrics + per-turn timeline parsed from agent streams; instruction-level familiarity lane (zero-shot/cheat-sheet/full-docs); billing-direction cost-weighted tokens; tuned git modes (untracked-cache/split-index/fsmonitor); output determinism + ANSI probes; proc.spawn overhead rows; history_archaeology + vcs_error_recovery workflows; tokens/turns-to-recovery; parallel contention lane (lock wait, throughput, lost updates, fsck integrity); token_calibration tool for char/4 bias; efficiency regression gates on exact metrics. Post-review fixes: claude/codex turn attribution, read-only oracle shapes, prepare steps in agent lane, grep VCS-dir exclusion, snapshot call accounting, worker exception guard, fsmonitor daemon teardown.

mrmrs 12d ago

Merged 21d5470d mrmrs-01bff3

mrmrs 12d ago

Merged ef77578d Initial Oak benchmark harness

mrmrs 12d ago

Closed

closed without merging

Closed zdgeier-d8eb3e Add a 'Finishing a task — commit and push, every time' rule to AGENTS.md so agents commit and push on task completion (after the instrument tests, no merge) instead of leaving local-only checkpoints. Zzdgeier 3d ago

Closed mrmrs-1e5214 Repair benchmark harness preflights, fixture limits, and failure diagnostics

mrmrs 6d ago

Closed mrmrs-e0f0b8 Harden the branch-triage benchmark lane's command provenance and scoring gates.

mrmrs 7d ago

Closed mrmrs-8a4e56 Validate JSON probes from full stdout capture

mrmrs 8d ago

Closed mrmrs-05258e 50d00e76

mrmrs 10d ago

Closed bench-phase0-fleet-sync sync/divergence recovery + mount-vs-clone lanes. sync_* workflows (workflow_ab): push-divergence/pull-upstream/pull-dirty/non-FF-amend with remote_purpose ctx plumbing, per-run disposable branches, workflow skip rows; live-verified both subjects against oak/bench-sync-tmp. Probed contracts encoded in expected returncodes: git push-reject rc1 / pull-unconfigured rc128 / pull --rebase preserves work; oak 0.96 push rc5, suggested pull rc5, pull --force discards commits (redo steps = recovery cost), and oak pull on a dirty tree SILENTLY discards uncommitted edits (rc0, no warning) — verify.local_payload failing is that measurement. error_mentions_recovery_command on failure rows (git amend non-FF error omits --force-with-lease: captured). mount_vs_clone.py acquisition lane: oak_mount vs git full/shallow/blobless/sparse(3-call) on byte-identical mirrors, time-to-first-read + task-scoped hydration delta via diskprobe, skip-row path contract-validated. row_parity normalizes run timestamps + disposable-branch uniquifiers. Mirrors seeded: github oakdotspace/bench-large-mirror main@f736927 (manifest-verified), oak/bench-large-mirror main single commit (mount verify pending: 2 startup timeouts >120s under load). 174 tests green.

mrmrs 11d ago

Closed mrmrs-14432c No commits yet — —

Closed mrmrs-c23b14 No commits yet — —

Closed mrmrs-da4b8b No commits yet — —

Closed mrmrs-e34a28 No commits yet — —

231 files

082519591e18

claims

cloudflare

config

docs

prompts

scenarios

scripts

tests

.gitignore .oakignore AGENTS.md CONTEXT.md COORDINATION.md LEDGER.md README.md

AGENTS.md 106 lines · 4.4 KB

AGENTS.md

Guidance for AI coding agents working in this benchmark repository.

This repo uses Oak

Use oak, not Git, for version control in this repository.

oak status
oak diff
oak commit          # local checkpoint only; does not publish
oak desc --file /tmp/oak-branch-desc.txt
oak push --repo oak/benchmarks

Do not mutate canonical checkouts such as /Users/mrmrs/o/oak, /Users/mrmrs/o/benchmarks, or /Users/mrmrs/o/oakspace. Work in an isolated worker-mrmrs-* checkout and report the full worker path with every branch, commit, and validation result.

Working On The Harness

Shared measurement policy (subjects, env, tokens, row contract, command semantics, stream adapters, reporting math) lives in scripts/oakbench/. Change it there, once — never copy a policy into a lane script.
Read CONTEXT.md for vocabulary and docs/adr/ for decisions before proposing structural changes.
Run the instrument tests before pushing: python3 -m unittest discover -s tests.
Null means unmeasured (ADR-0002): never emit a fabricated zero for a signal the runner did not observe.
Scenario and operation names are immutable identity (ADR-0005): changed semantics require new names or a version bump.
Before running remote-backed lanes, source the repo-local remote config: . config/bench-env.sh. Do not rely on ~/.zshrc; non-interactive agent shells often do not load it.

Finishing A Task

Before handing work back, leave the repo in a reviewable, server-visible state:

oak status
python3 -m unittest discover -s tests
oak commit
oak desc --file /tmp/oak-branch-desc.txt
oak push --repo oak/benchmarks

oak commit is local-only. Use oak push after validation so reviewers can see the branch. Do not merge your own branch unless the user explicitly asks for a safe-merge/land flow.

Improving Oak Against These Benchmarks (swarm runbook)

You are probably here to make Oak (../oak) faster and cheaper for agents. The loop:

python3 scripts/opportunities.py — the ranked attack list generated from the latest results. Pick a target from the top; don't re-derive priorities from raw JSONL.
Edit Oak in ../oak, build it (cargo build --release there).
python3 scripts/devloop.py — one command, one verdict: PASS or REGRESSED for your changeset vs the Oak baseline.
Only claim a win that devloop's printed detection limit supports.

Claim discipline (non-negotiable):

Never claim a delta below the noise floor. devloop prints its own per-lane detection limits; the default --runs 2 can only catch large changes. Chasing a 3% win? Bump --runs until the limit is below the effect you claim, or the claim is noise.
A cheaper output that loses information is a regression, not a win. The gates enforce this: information_recall < baseline or lost pipe_compatible_unified structure fails the verdict (RECALL / PIPE-COMPAT lines). Don't try to win tokens by omitting changed-file names or unified-diff hunks.
Never aggregate across instruction levels or transports. remote.net.* rows (real server, network) are oak-vs-previous-oak trend evidence only — they are deliberately named differently from git's local-file remote.* ops and must never be compared against them.
Measurement serializes; building parallelizes. Lanes take a cross-process measurement lock so concurrent agents don't poison each other's timings. If your run waits on the lock, that's correct behavior — do your editing/building while waiting, never set OAK_BENCH_LOCK=off on a shared machine.
Run with your own results dir when working in parallel (--results results/<your-task-slug>), and leave results/latest.* to serialized verdict runs.
Skips are work items, not noise. A returncode-77 row names exactly what infrastructure would unlock a measurement (see the "Unmeasured" section of opportunities.py). Unlocking coverage is as valuable as improving a number.

Benchmark Data Hygiene

Do not commit raw real-agent transcripts, generated result JSONL, run artifacts, or temporary workspaces.
Keep benchmark scenarios, scripts, configs, dashboard schema, and docs in the repo.
Put large/generated benchmark outputs in external storage or ignored local directories.
Treat --agent-environment minimal as the default for publishable token comparisons; use local-default only when intentionally measuring local agent customization overhead.

# AGENTS.md

Guidance for AI coding agents working in this benchmark repository.

## This repo uses Oak

Use `oak`, not Git, for version control in this repository.

```bash
oak status
oak diff
oak commit          # local checkpoint only; does not publish
oak desc --file /tmp/oak-branch-desc.txt
oak push --repo oak/benchmarks
```

Do not mutate canonical checkouts such as `/Users/mrmrs/o/oak`,
`/Users/mrmrs/o/benchmarks`, or `/Users/mrmrs/o/oakspace`. Work in an isolated
`worker-mrmrs-*` checkout and report the full worker path with every branch,
commit, and validation result.

## Working On The Harness

- Shared measurement policy (subjects, env, tokens, row contract, command
  semantics, stream adapters, reporting math) lives in `scripts/oakbench/`.
  Change it there, once — never copy a policy into a lane script.
- Read `CONTEXT.md` for vocabulary and `docs/adr/` for decisions before
  proposing structural changes.
- Run the instrument tests before pushing: `python3 -m unittest discover -s tests`.
- Null means unmeasured (ADR-0002): never emit a fabricated zero for a signal
  the runner did not observe.
- Scenario and operation names are immutable identity (ADR-0005): changed
  semantics require new names or a version bump.
- Before running remote-backed lanes, source the repo-local remote config:
  `. config/bench-env.sh`. Do not rely on `~/.zshrc`; non-interactive agent
  shells often do not load it.

## Finishing A Task

Before handing work back, leave the repo in a reviewable, server-visible
state:

```bash
oak status
python3 -m unittest discover -s tests
oak commit
oak desc --file /tmp/oak-branch-desc.txt
oak push --repo oak/benchmarks
```

`oak commit` is local-only. Use `oak push` after validation so reviewers can
see the branch. Do not merge your own branch unless the user explicitly asks
for a safe-merge/land flow.

## Improving Oak Against These Benchmarks (swarm runbook)

You are probably here to make Oak (`../oak`) faster and cheaper for agents.
The loop:

1. `python3 scripts/opportunities.py` — the ranked attack list generated from
   the latest results. Pick a target from the top; don't re-derive priorities
   from raw JSONL.
2. Edit Oak in `../oak`, build it (`cargo build --release` there).
3. `python3 scripts/devloop.py` — one command, one verdict: PASS or REGRESSED
   for your changeset vs the Oak baseline.
4. Only claim a win that devloop's printed detection limit supports.

Claim discipline (non-negotiable):

- **Never claim a delta below the noise floor.** devloop prints its own
  per-lane detection limits; the default `--runs 2` can only catch large
  changes. Chasing a 3% win? Bump `--runs` until the limit is below the
  effect you claim, or the claim is noise.
- **A cheaper output that loses information is a regression, not a win.**
  The gates enforce this: `information_recall` < baseline or lost
  `pipe_compatible_unified` structure fails the verdict (RECALL /
  PIPE-COMPAT lines). Don't try to win tokens by omitting changed-file
  names or unified-diff hunks.
- **Never aggregate across instruction levels or transports.**
  `remote.net.*` rows (real server, network) are oak-vs-previous-oak trend
  evidence only — they are deliberately named differently from git's
  local-file `remote.*` ops and must never be compared against them.
- **Measurement serializes; building parallelizes.** Lanes take a
  cross-process measurement lock so concurrent agents don't poison each
  other's timings. If your run waits on the lock, that's correct behavior —
  do your editing/building while waiting, never set `OAK_BENCH_LOCK=off` on
  a shared machine.
- **Run with your own results dir when working in parallel**
  (`--results results/<your-task-slug>`), and leave `results/latest.*` to
  serialized verdict runs.
- **Skips are work items, not noise.** A returncode-77 row names exactly
  what infrastructure would unlock a measurement (see the "Unmeasured"
  section of opportunities.py). Unlocking coverage is as valuable as
  improving a number.

## Benchmark Data Hygiene

- Do not commit raw real-agent transcripts, generated result JSONL, run
  artifacts, or temporary workspaces.
- Keep benchmark scenarios, scripts, configs, dashboard schema, and docs in the
  repo.
- Put large/generated benchmark outputs in external storage or ignored local
  directories.
- Treat `--agent-environment minimal` as the default for publishable token
  comparisons; use `local-default` only when intentionally measuring local agent
  customization overhead.

1	`# AGENTS.md`
2
3	`Guidance for AI coding agents working in this benchmark repository.`
4
5	`## This repo uses Oak`
6
7	Use `oak`, not Git, for version control in this repository.
8
9	```bash
10	`oak status`
11	`oak diff`
12	`oak commit # local checkpoint only; does not publish`
13	`oak desc --file /tmp/oak-branch-desc.txt`
14	`oak push --repo oak/benchmarks`
15	```
16
17	Do not mutate canonical checkouts such as `/Users/mrmrs/o/oak`,
18	`/Users/mrmrs/o/benchmarks`, or `/Users/mrmrs/o/oakspace`. Work in an isolated
19	`worker-mrmrs-*` checkout and report the full worker path with every branch,
20	`commit, and validation result.`
21
22	`## Working On The Harness`
23
24	`- Shared measurement policy (subjects, env, tokens, row contract, command`
25	semantics, stream adapters, reporting math) lives in `scripts/oakbench/`.
26	`Change it there, once — never copy a policy into a lane script.`
27	- Read `CONTEXT.md` for vocabulary and `docs/adr/` for decisions before
28	`proposing structural changes.`
29	- Run the instrument tests before pushing: `python3 -m unittest discover -s tests`.
30	`- Null means unmeasured (ADR-0002): never emit a fabricated zero for a signal`
31	`the runner did not observe.`
32	`- Scenario and operation names are immutable identity (ADR-0005): changed`
33	`semantics require new names or a version bump.`
34	`- Before running remote-backed lanes, source the repo-local remote config:`
35	`. config/bench-env.sh`. Do not rely on `~/.zshrc`; non-interactive agent
36	`shells often do not load it.`
37
38	`## Finishing A Task`
39
40	`Before handing work back, leave the repo in a reviewable, server-visible`
41	`state:`
42
43	```bash
44	`oak status`
45	`python3 -m unittest discover -s tests`
46	`oak commit`
47	`oak desc --file /tmp/oak-branch-desc.txt`
48	`oak push --repo oak/benchmarks`
49	```
50
51	`oak commit` is local-only. Use `oak push` after validation so reviewers can
52	`see the branch. Do not merge your own branch unless the user explicitly asks`
53	`for a safe-merge/land flow.`
54
55	`## Improving Oak Against These Benchmarks (swarm runbook)`
56
57	You are probably here to make Oak (`../oak`) faster and cheaper for agents.
58	`The loop:`
59
60	1. `python3 scripts/opportunities.py` — the ranked attack list generated from
61	`the latest results. Pick a target from the top; don't re-derive priorities`
62	`from raw JSONL.`
63	2. Edit Oak in `../oak`, build it (`cargo build --release` there).
64	3. `python3 scripts/devloop.py` — one command, one verdict: PASS or REGRESSED
65	`for your changeset vs the Oak baseline.`
66	`4. Only claim a win that devloop's printed detection limit supports.`
67
68	`Claim discipline (non-negotiable):`
69
70	`- Never claim a delta below the noise floor. devloop prints its own`
71	per-lane detection limits; the default `--runs 2` can only catch large
72	changes. Chasing a 3% win? Bump `--runs` until the limit is below the
73	`effect you claim, or the claim is noise.`
74	`- A cheaper output that loses information is a regression, not a win.`
75	The gates enforce this: `information_recall` < baseline or lost
76	`pipe_compatible_unified` structure fails the verdict (RECALL /
77	`PIPE-COMPAT lines). Don't try to win tokens by omitting changed-file`
78	`names or unified-diff hunks.`
79	`- Never aggregate across instruction levels or transports.`
80	`remote.net.*` rows (real server, network) are oak-vs-previous-oak trend
81	`evidence only — they are deliberately named differently from git's`
82	local-file `remote.*` ops and must never be compared against them.
83	`- Measurement serializes; building parallelizes. Lanes take a`
84	`cross-process measurement lock so concurrent agents don't poison each`
85	`other's timings. If your run waits on the lock, that's correct behavior —`
86	do your editing/building while waiting, never set `OAK_BENCH_LOCK=off` on
87	`a shared machine.`
88	`- Run with your own results dir when working in parallel`
89	(`--results results/<your-task-slug>`), and leave `results/latest.*` to
90	`serialized verdict runs.`
91	`- Skips are work items, not noise. A returncode-77 row names exactly`
92	`what infrastructure would unlock a measurement (see the "Unmeasured"`
93	`section of opportunities.py). Unlocking coverage is as valuable as`
94	`improving a number.`
95
96	`## Benchmark Data Hygiene`
97
98	`- Do not commit raw real-agent transcripts, generated result JSONL, run`
99	`artifacts, or temporary workspaces.`
100	`- Keep benchmark scenarios, scripts, configs, dashboard schema, and docs in the`
101	`repo.`
102	`- Put large/generated benchmark outputs in external storage or ignored local`
103	`directories.`
104	- Treat `--agent-environment minimal` as the default for publishable token
105	comparisons; use `local-default` only when intentionally measuring local agent
106	`customization overhead.`