/oak

Reviews

0 open branches

1 merged · 24h 86 merged · all-time

Open

merge when ready

Nothing here.

Merged

shipped recently

Merged 08251959 Refresh benchmarks/README.md as agent-native VCS marketing grounded in real coverage.

mrmrs 3hr ago

Merged dd17777d Fix branch-fleet benchmark oracle and cleanup failure metadata.

mrmrs 1d ago

Merged 313053a7 Updates benchmark repo agent instructions with explicit isolated-worker and finish-publish rules. Why: the stale zdgeier-d8eb3e branch tried to add a commit-and-push rule but was too old to land safely; this reapplies the useful AGENTS.md-only guidance on current main without deleting benchmark harness files.

mrmrs 3d ago

Merged 4a284352 Add launch claim specs for live branch-fleet evidence.

mrmrs 3d ago

Merged 28ec9a1e Update benchmark coverage after M7 mount evidence.

mrmrs 3d ago

Merged 0d4334d3 Add public claim gates that rederive launch numbers from raw JSONL.

mrmrs 3d ago

Merged 956532f0 Repair launch proof harness behavior for the new agent-native contract. Agent-state JSON probes now declare minimum Oak 0.98.0 and emit explicit capability skips for older installed binaries instead of failing required-field checks; branch-fleet and sync live lanes preflight commit --push before setup; mount finish passes an absolute desc-file path; and partial-destination mount recovery is reported as an explicit unsupported skip/finding. Validation: python3 -m unittest discover -s tests; py_compile on changed scripts; core compact rerun exits 0; mount_probe rerun exits 0.

mrmrs 4d ago

Merged a95a262a Accelerate branch_fleet live setup with bounded parallel seed and cleanup while preserving one fleet.seed row and the measured classify/plan/apply/sync/oracle workflow. Records worker counts, global failure indices, and cleanup diagnostics so n100 validation is practical without changing the oracle contract.

mrmrs 4d ago

Merged 1d8dcb61 Harden and accelerate branch_fleet live seed in oak/benchmarks. The Oak live seed path now collapses per-branch commit+push into explicit oak commit --push, emits per-branch and conflict-advancer progress, records seed diagnostics/failure metadata, retries transient 502/503/504 seed failures, wraps timeouts as failed seed rows, scopes cleanup to remote-visible disposable branches, and keeps classify/plan/apply/sync/oracle skipped after partial setup.

mrmrs 4d ago

Merged a9d9c0c3 Use checkout-free `oak merge <branch>` in branch-fleet apply.

mrmrs 5d ago

Merged 09bf4468 Keep full machine-parse stdout for branch-fleet classification.

mrmrs 5d ago

Merged ec7a90bf Update benchmark semantics for Oak's explicit publish contract.

mrmrs 5d ago

Merged 3c10ba56 Fix 16 benchmark harness correctness bugs

mrmrs 5d ago

Merged 137d0916 Build and harden the propose-mode overnight perf-improvement loop harness for oak/benchmarks.

mrmrs 5d ago

Merged b3c75a34 Benchmark summary transport guard: workflow A/B summaries now suppress elapsed-time deltas when successful workflow.total rows use different remote/workspace transports (for example Git local_file vs Oak network sync workflows), while still showing each subject's raw elapsed ms and preserving token/tool-call deltas. Adds regression coverage so transport-dependent sync rows cannot be reported as misleading Oak-vs-Git speed deltas.

mrmrs 5d ago

Merged 1e6c31fa Expose branch-fleet workflow-only timing and add a read-only Oak field-map helper.

mrmrs 5d ago

Merged 53fed5b6 Make devloop exercise Oak clone/push storm as part of the agent-scale reliability gate.

mrmrs 5d ago

Merged 97eab9b8 Add branch_fleet_nN platform benchmarks for agent branch-fleet workflows, with failure-mode hardening.

mrmrs 6d ago

Merged 9cbec33c Repair benchmark harness preflights, fixture limits, and failure diagnostics

mrmrs 6d ago

Merged ece1c3e8 bench: branch-triage fixture v2 labels and risk metrics

mrmrs 6d ago

Merged 34da23f1 Add result delta report helper

mrmrs 6d ago

Merged 57a43536 Add branch-triage-shape fixture, metrics, and lane tracer bullet

mrmrs 7d ago

Merged f48e0d3b Measure compact Oak agent state JSON

mrmrs 8d ago

Merged 9c6460ae Refine agent-native JSON probe: separate token accounting from JSON validation. run_timed gains opt-in full_output_bytes that captures stdout_full_text alongside the normal admitted (capped) stdout_text. The JSON probe now measures tokens/bytes on the admitted --admitted-output-chars window (consistent with every other op, fixing the prior interim fix that measured JSON probes on the full window) while validating the JSON oracle against the full capture (so a >20k branch review/diff JSON parses whole — the original truncation false-failure stays fixed). Full capture is bounded at 64MB (checked before reading: no OOM); over the bound, the probe is honestly UNMEASURED (json_validation_source null, json_* null, json_validation_unmeasured_reason set, returncode stays 0) rather than a false parse-failure — the final gate is 'json_oracle_passed is False' so unmeasured never trips returncode 1. Rows stamp json_validation_source (full_stdout_capture | admitted_stdout_capture | null). Two regression tests lock it: admitted-truncated-but-fully-validated keeps token accounting on the admitted window; over-cap full capture is unmeasured not failed. 752 tests green. Supersedes the interim 8f793b35 by restoring uniform token accounting.

mrmrs 8d ago

Merged 8f793b35 Fix agent-native JSON probe truncating large oak JSON into false parse-failure rows: the probe captured with the 20k --admitted-output-chars window, so oak branch review/diff --json (~67KB on high-cardinality branches) was truncated mid-document and the oracle recorded an oak parse failure even though oak emitted complete valid JSON (hit both oak_installed and oak_local in standard runs). Now capability JSON is captured whole via INFO_PROBE_MAX_CHARS like the info probe, and if output ever exceeds even that window the probe emits an honest output_exceeded_capture_window skip (rc 77) instead of parsing truncated bytes and blaming the subject. General command runner unchanged (admitted-output cap is its intended byte-measurement policy). Two regression tests: >20k JSON now parses + captures with the full window; truncated output is an honest skip not a parse failure. 750 tests green.

mrmrs 8d ago

Merged ec21b5d3 Harden fake-provider publish gate and conflict oracle validation: publish_gate now rejects any driver=='fake-provider' or branch_triage_provider=='fake' row from publishable inputs independent of profile (enforced guard, not relying on the platform-not-public-core boundary); conflict_resolution_lane validates oracle schema (expected_conflict/conflicted_paths/resolution/exactly-one-content-form/post_merge_check) and emits honest returncode-77 skip rows on malformed oracles instead of crashing; text resolution content pinned as exact UTF-8 bytes with no newline normalization (documented contract + CRLF-preserving test). Regression tests for all three. 748 tests green.

mrmrs 8d ago

Merged 98107fc8 Close scale-lane benchmark coverage gaps for LFS, netshape TTFD, and monorepo fixture status

mrmrs 8d ago

Merged 691cef97 Wire agent workflow VCS shim sidecar thrash metrics with null reasons

mrmrs 8d ago

Merged b582f31d Add conflict resolution workflow lane

mrmrs 8d ago

Merged 58b666fe Add minimal branch triage platform lane

mrmrs 8d ago

Merged d18c9c59 Harden platform lane MVP semantics rows and comparison keys

mrmrs 8d ago

Merged 16f312fb Verify content-integrity source strength in repro bundles

mrmrs 8d ago

Merged a1c3d8a4 Harden content integrity public-trust gating

mrmrs 8d ago

Merged 6b3c8ffe Aggregate content-attestation payload sources conservatively

mrmrs 8d ago

Merged f33d82ca Add Oak agent-native JSON capability probes

mrmrs 8d ago

Merged eb8b541d Enforce honest cold-cache state in mount lanes

mrmrs 8d ago

Merged d8a5057a Stamp benchmark rows with environment isolation provenance

mrmrs 8d ago

Merged 0e2857c3 Strengthen provenance hash coverage for core lane rows

mrmrs 8d ago

Merged b276e409 Reject negative baseline noise-floor overrides

mrmrs 8d ago

Merged 7e254e81 zdgeier-d8eb3e Zzdgeier 9d ago

Merged e75f001e Add canonical benchmark remote env file for agents

mrmrs 9d ago

Merged 06ab0f78 Large-binary/LFS tuned mode + netshape TTFD op, reviewed and hardened before landing. git_lfs mode: applies_to enforced from config/git_modes.json (single source of truth; small-binary scenarios honestly mode_skipped), mode.setup.lfs_install/lfs_track rows mirroring fsmonitor's setup pattern, git_lfs in publish_gate TUNED_GIT_SUBJECTS (setup failures trip the gate; presence not required since LFS coverage is scenario/host-dependent — REQUIRED_TUNED_GIT_SUBJECTS split pinned by test), track-coverage guard failing setup closed on untracked binary extensions (lfs_track_incomplete:<ext> — an untracked comparator is an unfair comparator), mocked success-path test with a git-lfs PATH shim + live skipUnless smoke. netshape_log_follow_cold (renamed pre-publication per ADR-0005 state-in-name convention): byteproxy TTFD fields verified lock-guarded, non-empty-chunk-only, reset per run; rows carry ttfd_ms/ttfd_source/first_payload_bytes_server_to_client as explicit nulls when unobserved; ttfd_semantics label cold_network_acquire_then_file_history_query on rows. 695 tests green (2 skips: stress-ng, live git-lfs).

mrmrs 10d ago

Merged 81528a03 GitGoodBench importer: accept the real HuggingFace CSV shape. The raw Lite CSV encodes the scenario column as a Python-literal dict string (pandas repr), not JSON, and carries an unnamed leading index column — parse JSON first then ast.literal_eval (literals only, never code), with BOTH parse paths now accepting only dict/list results so a scalar scenario skips as scenario_json_unparseable instead of going ready. Regression tests for the raw HF shape and the scalar guard; prior-art.md policy row updated to reflect the completed Apache-2.0 license review (clean-room no-code-copy rule retained) and the corrected schema note. Verified against the public Lite CSV first row: ready 1, skipped 0. 676 tests green.

mrmrs 10d ago

Merged 8c56ba02 GitGoodBench: proper source review + citation. prior-art.md gains the full REALM 2025 BibTeX (Lindenbauer/Bogomolov/Zharov, doi:10.18653/v1/2025.realm-1.19) and verified facts replacing secondhand guesses: real 12-column schema with sample_type merge|file_commit_chain, scenario as embedded JSON, HF distribution (900/120/17469 splits, Apache-2.0, 816 repos), upstream harness not runnable (proprietary code removed), their success-rate-only reporting vs our resource-priced gap. Importer modernized to the real schema: canonical sample_type mapping, file_commit_chain expands to interactive_rebase + iterative_commit scenarios, scenario JSON parsed liberally and embedded verbatim as source_scenario, JSONL input support, citation + source_datasets stamped in output (schema_version 2), legacy liberal mapping kept as fallback. 674 tests green.

mrmrs 10d ago

Merged 316e7330 Persist regression_dimensions end-to-end (review fix): comparisons insert binds it as a JSON string, schema_v2 adds the column additively and refreshes dashboard_comparisons to expose it, StubD1 ingest test proves a p90-only regression round-trips as '["p90"]' with is_regression 1; schema.sql + schema_v2.sql verified to apply cleanly against sqlite. 667 tests green + node parity 16/16.

mrmrs 10d ago

Merged eca97060 Fix 8 second-round review findings + 1 sibling: latest.jsonl pointer only advances after row-contract validation (raw file kept for emitter debugging), dashboard is_regression considers p50 OR p90 with regression_dimensions + severity from breaching dimension, re-upload deletes aggregates/comparisons alongside measurements (no stale groups), JS tail honesty mirrors TAIL_MIN_SAMPLES (p95 null under n=20, p99 under n=100, golden-parity asserted both runtimes), all 26 remaining int(row.get(returncode)) call sites across 10 scripts use row_returncode (+ process_returncode sibling in workflow_ab), thrash polling loops emit one in-place-updated event per loop start instead of re-emitting as the window grows, netshape teardown runs after mid-create failure (created flag set before first command, cleanup-then-reraise), netshape exit code computed over subject rows only (self-test row can no longer mask all-skipped). 667 tests green + node parity 15/15.

mrmrs 10d ago

Merged 61491464 Fix 10 external-review findings in evidence/verdict layers: repro_verify requires full raw_inputs coverage (incomplete repros cap at COMPARABLE), scorecard reports honest n=contributing-rows + n_successful and caps GREEN/RED at AMBER on partial metric coverage (ADR-0008 updated), fixture verify_fixture three-state status (missing/empty dir fails, manifest-null is unverifiable not verified — reviewer repro now exits 1), baseline campaign pins/measures/versions ONE resolved git binary, dashboard ingest handles subject-less platform rows via effectiveSubject (github/driver identity, skips counted never silent), subject identity keyed on subject_id so mixed binaries never collapse (schema_v2 unique index + documented supersession), comparisons computed against every git-kind baseline with explicit baseline_subject_id + separator-safe keys (order-independent), netshape oak path seeds the served repo before measuring with skip cascade on clone failure, non-race check-instant actually posts+settles status on api driver and honestly skips on cli driver (semantic_contract_match gated on executability), clone_push_storm rows record pushes_succeeded/clones_succeeded as observed zeros so starvation is measurable. 652 tests green + node parity 12/12.

mrmrs 10d ago

Merged aa73b9f7 Document infrastructure plan: control-plane (Cloudflare) vs measurement-plane (bare metal) split, hardware/GitHub/dataset/budget requirements keyed to the exact skip gates in the code, 7-step execution order

mrmrs 10d ago

Merged 4c6fad70 Phase 4: subject-kind plugin interface (config/subject_kinds git/oak/jj with parity_report proving measurement-identical extraction from command_semantics — git/oak parity empty; jj pilot with honest capability-gap nulls), near-declarative scenario_spec loader, lane-contract conformance suite (9 lanes × 5 checks with negative tests + meta-test forcing new lanes to register; reality-vs-plan exemptions discovered by grep and documented), dashboard ingest complete (rows→measurements/subjects upsert/aggregates/comparisons as exported pure functions, schema_v2 additive, scorecard/trends/baseline-book/verifications endpoints, HMAC webhook, Python/JS formula parity via shared golden fixture asserted from both runtimes — node 9/9), stats upgrades (MAD modified-Z flag-never-delete with first-run warmup hint, adaptive rep planning, BH-FDR detail, two-stage confirmation) + statistical-methodology.md, adding-a-subject.md, new-lane-template.md. 624 tests green (1 skip: stress-ng absent).

mrmrs 10d ago

Merged ebb35c36 Phase 3: netshape lane (netns/veth/tc-netem profiles with ±15% self-test rows, stdlib smart-HTTP gitserver proven by real clone+push+protocol-v2 tests, TRANSPORT_NETWORK_SHAPED + shaped_comparison_legal, macOS skip honesty), fleet scale (flock-guarded .partial.jsonl crash insurance, merge_train/rebase_storm/mixed_fleet/conflict_storm/clone_push_storm/long_divergence_k modes with queue-wait/attempts-to-land/poll-latency/lost-update accounting, fleet_report saturation knee + fairness + 5xx detection, existing rows byte-identical), stress-ng loadgen (--load-tier on core+workflow, always-timeout/temp-path/seed discipline, bogo-ops verify-only, skip rows when load unapplyable, envwatch boundary samples + environment_suspect on every row) + falsification tests, t/perf external runner (arm's-length GPLv2-clean aggregation, min-of-N sanity band, golden-fixture parser). 551 tests green (1 skip: stress-ng absent).

mrmrs 10d ago

Merged e85f6e6f Phase 2: repro bundles (repro_bundle.py pinning config/scenario/fixture/binary/noise-floor provenance + repro_verify.py with VERIFIED-EXACT/VERIFIED/COMPARABLE/DIVERGENT verdicts, exact machine-independent matching + ratio-CI overlap for latency), bench-conflict-corpus-v1 generator (9 deterministic conflict kinds incl. adjacent-lines clean-merge-broken-build with committed check.py oracle, self-tested per kind), dirty-tree spectrum (status/diff.dirty.p01/.p10/.p50 behind --dirty-spectrum, untimed seeded dirtying with byte-identical restore, flag-off path proven unchanged), VCS PATH shim (flock-safe sidecar JSONL, overhead measured never subtracted, thrash metrics populated: agent_blocked_on_vcs_ms/vcs_share_of_task_wall/thrash_events_count), integration_race_n1/n10 with rate-limit budgeter pacing recorded as race.pacing.injected + check-instant protection variant (harness-posted statuses). 433 tests green.

mrmrs 10d ago

Merged 3242fb8f Platform lane MVP: pr_single_anatomy + poll_until_merged_i10s + capability probe on protection-none, cli and harness-api drivers (api_metrics exact, token fields null by honesty), monotonic settle clock with poll quantization reporting, rate-limit budgeter (pacing recorded never subtracted), platform_semantics.json fairness contract with all six protection variants name-reserved, runner-stamped rows + platform lane row contract, skip-row honesty without credentials (exit 3), lane excluded from devloop. 380 tests green.

mrmrs 10d ago

Merged b2c66f0c Phase 1 remainder: runner discipline (cachectl purge+honest cache_state, envwatch boundary sampling + environment_suspect, runner_calibration micro-suite CLI, runner-discipline.md), token calibration scaffolding (versioned token_calibration.json uncalibrated-v0, oakbench/calibration.py reporting-layer conversion, token_calibration_campaign.py with tiktoken/heuristic methods), devloop --scorecard (informational target-TDD section vs CURRENT book), xl/monorepo fixture registry entries + make_monorepo_fixture.py deterministic fast-import generator, GitGoodBench-Lite importer with mirror-missing skip honesty. 349 tests green.

mrmrs 10d ago

Merged c7f98378 Phase 1 TDD loop: targets.json + Git Baseline Book loader/emitter + scorecard evaluator/CLI + baseline campaign CLI, with review fixes: noise-inflated GREEN/RED bounds per ADR-0008 (no instrument-luck greens), goal margin measured as distance from parity (fixes goal>1), null-returncode rows treated as failures not crashes, bootstrap default unified at 2000, baseline_book CLI auto-stamps runner identity (ADR-0007), shared metric_value/iter_jsonl/number_or_none (dedup), clear goal validation errors. 301 tests green.

mrmrs 10d ago

Merged 27cab567 Phase 0 benchmark foundation: runner identity, fixture registry, shared stats, byteproxy/hyperfine cross-check, thrash helpers, prior-art coverage, and baseline/scorecard ADRs.

mrmrs 10d ago

Merged 983e59ef Fix benchmark readiness accounting and harness gates

mrmrs 10d ago

Merged 77c6fbfd Add public benchmark readiness bundle and stricter publish gates

mrmrs 10d ago

Merged 0653ae77 Fix benchmark accounting and harness correctness bugs

mrmrs 10d ago

Merged 6d6796e0 mrmrs-9e4c0d

mrmrs 10d ago

Merged a015cd02 sync/divergence recovery + mount-vs-clone lanes. sync_* workflows (workflow_ab): push-divergence/pull-upstream/pull-dirty/non-FF-amend with remote_purpose ctx plumbing, per-run disposable branches, workflow skip rows; live-verified both subjects against oak/bench-sync-tmp. Probed contracts encoded in expected returncodes: git push-reject rc1 / pull-unconfigured rc128 / pull --rebase preserves work; oak 0.96 push rc5, suggested pull rc5, pull --force discards commits (redo steps = recovery cost), and oak pull on a dirty tree SILENTLY discards uncommitted edits (rc0, no warning) — verify.local_payload failing is that measurement. error_mentions_recovery_command on failure rows (git amend non-FF error omits --force-with-lease: captured). mount_vs_clone.py acquisition lane: oak_mount vs git full/shallow/blobless/sparse(3-call) on byte-identical mirrors, time-to-first-read + task-scoped hydration delta via diskprobe, skip-row path contract-validated. row_parity normalizes run timestamps + disposable-branch uniquifiers. Mirrors seeded: github oakdotspace/bench-large-mirror main@f736927 (manifest-verified), oak/bench-large-mirror main single commit (mount verify pending: 2 startup timeouts >120s under load). 174 tests green.

mrmrs 10d ago

Merged 4522f806 mrmrs-baf5cf

mrmrs 10d ago

Merged 97594e90 mount-vs-clone live data landed (n=3/variant, SSH remote): git full 28.1s/670MB, shallow 21.1s/574MB, blobless 24.2s/583MB, sparse-task 7.1s/66MB/3calls to first read on the 351MB/28k-file mirror; oak_mount 0/3 — wide-tree 120s mount timeout REPRODUCED ON CLEAN FSKIT SLATE (leaked-mount hypothesis ruled out), P0 product finding for ../oak: mount startup doesn't scale to the repo class the lazy-hydration pitch targets. Coverage doc gates the oak-vs-git acquisition claim until mount comes up; git-only costs citable with network caveat. Raw lane data copied under results/ (ignored).

mrmrs 10d ago

Merged 15f60559 Branch-lifecycle findings: oak close never propagates to the server (verified via fresh-clone round-trips, by-name and as-current); every clone leaves a transient open branch in listings; merge is the only verified server-side branch remover.

mrmrs 10d ago

Merged 2c45d63d Accuracy & insight push, complete: Phase 0 remote plumbing (oakbench/remotes.py, diskprobe.py), agent-fleet lane live for oak (N mounts, time_to_nth_workspace curves, allocated-vs-visible marginal disk, merge phase + integrity; campaign hardening: partial-row flush, timeout-as-error-row), sync/divergence recovery workflows incl. non-FF-amend + sync.recovery_metrics + actionability, task-loop lane (lane 'task-loop'), real-agent adapter fixes (cursor camelCase usage fallback, goldens for claude/codex/cursor), instruction_level_report.py (zero-shot vs cheat-sheet), publish_gate.py (passing on tuned core + agent + task-loop results), plus the other session's mount-vs-clone live data docs, real-agent findings writeup, and COORDINATION.md. 189 tests green at merge.

mrmrs 10d ago

Merged fa5e945c tokens: price the command an agent types, not the harness's binary path

mrmrs 11d ago

Merged 50e81e22 workflow_ab: history lane measures oak's real file-history and pickaxe when the binary has them

mrmrs 11d ago

Merged 54ff4a1b command_semantics: oak core-equivalent diff is 'diff --print' (v2026-06-11.1)

mrmrs 11d ago

Merged b8a66469 bench-diff-contract

mrmrs 11d ago

Merged 10471910 Add prompts/oak-fix-handoff.md: the complete oak-product findings writeup from two days of benchmark instrumentation, shaped to hand to an agent fixing ../oak. Twenty items in four tiers with observed-verbatim evidence and per-item verification commands. P0 error-recovery traps: self-contradictory oak pull divergence message that recommends only the destructive --force path (the safe push-then-fetch path is never mentioned), branch rename can't find the current branch, same-name-collision error gives impossible advice, mount install success delivered as 'error: Server error:' with the wrong extension name (OakFS vs OakFSExtension), empty-repo mount dead-end with no seeding guidance, dirty-mount teardown refusal without printed recovery steps. P1 performance cliffs from first live mount numbers: 1.4-1.7s first-write-into-mount (then 66ms), warm mount no faster than cold ~1.0s, 281ms no-op fetch, push-dominated 1.2s lifecycle iterations with 250-330ms desc roundtrips. P2 output contracts: piped diff 0% unified-compatible (keep the compactness — recall is 1.0 — add structure when not a TTY), status/branch hot-path verbosity vs git, ASCII banner in agent-paid contexts, missing porcelain/name-only/quiet modes. P3: oak must self-report hydration bytes (FSKit allocation reporting makes disk-side measurement impossible; harness already parses 'hydrated/downloaded: N bytes' patterns), no fsck equivalent, history interrogation gaps (show/blame/pickaxe/file-log), oak finish unshipped (benchmark slot ready), CLI-Mount-app version coupling undefined, silent clean-tree commit. Closes with the don't-regress list: 11-16ms in-mount commits, 10x better lock-wait than git under contention, recall 1.0 at 45 bytes/file, one-call snapshots, content-independent mount cost. REWRITE both prompts with the CORRECT orientation (previous versions were backwards and broke three live agent runs): agents work IN a clone of oak/oak and run the benchmark harness AS THEY WORK — they are not launched from oak-benchmarks. benchmark-changeset.md: OAK_SRC = the oak clone you are working in (verify Cargo.toml); BENCH = your own sibling clone of oak/benchmarks, created once, run-only, never another agent's checkout; the loop is edit -> cargo build/test -> devloop --skip-build --oak-local-bin with the binary passed EXPLICITLY every time (defaults like ../oak point at nothing in this layout). optimization-orchestrator.md: session workspace contains the orchestrator's own bench clone plus per-worker dirs each holding BOTH an oak/oak clone (edit here) and a bench clone (measure here); orchestrator computes concrete absolute paths at spawn time and injects them into worker prompts — workers never guess paths and never touch a path they didn't receive; orchestrator re-runs every verdict itself from its own bench against the worker's binary before landing.

mrmrs 11d ago

Merged 19185486 Add prompts/optimization-orchestrator.md: the standing prompt for an autonomous optimization orchestrator (karpathy-autoresearch-style loop). It runs a tmux fleet of codex/claude/cursor workers against ../oak: baseline at n=10, generate ranked hypotheses from scripts/opportunities.py, assign file-disjoint experiments, judge each with devloop + both test suites + quality gates, LAND/KILL/ITERATE, append-only lab journal + leaderboard, explore/exploit split, landing protocol (per-win oak commits with lab-note descs, merge only on full green), cleanup duties (leaked mounts), and stop conditions with a final report. Encodes the session-learned worker launch flags and the hard rules: no claims below the printed noise floor, correctness suites must pass, recall/pipe-compat/determinism/integrity losses are regressions regardless of token wins, never compare network vs local-file transports, never tune the harness to flatter a number (harness fixes are separate changesets). Also adds prompts/benchmark-changeset.md: the prompt for an agent that changed ../oak and wants numbers. Key design: git is a CONTROL GROUP, not a moving part — measured once per harness version into results/baselines/git-smoke.jsonl with a staleness sidecar (benchmarks repo hash + host + command_semantics_version), then every experiment runs only oak_installed/oak_local and regression_report.py merges the baseline file with the fresh run to produce git-vs-installed-vs-changeset in one report. Cuts per-experiment measurement time roughly in half and keeps the git anchor statistically stable across a whole research session. Flow verified live: cross-file deltas compute correctly. Includes per-lane commands, the mount --oak-bin caveat, and the runbook claim-discipline rules. Orchestrator prompt gains a worker-isolation topology section: per-worker oak CLONES on native disk (each experiment is its own changeset; cargo builds through a FSKit mount are slow, leave mounts dirty, and add mount-layer variance that contaminates verdict attribution), verdicts via devloop --skip-build --oak-local-bin <worker-clone>/target/release/oak, oak-benchmarks checkout shared read-only. Deliberate exception: exactly one explore-lane worker loops INSIDE an oak mount (CARGO_TARGET_DIR outside) to dogfood mount friction into the journal — the swarm-in-mounts product thesis as a recorded experiment, not an unexamined default. FIX: both prompts were hardcoding this machine's absolute paths (/Users/mrmrs/o/oak-benchmarks, /Users/mrmrs/o/oak) — multiple orchestrators launched from different checkouts all converged on ONE checkout, switching its branches under each other and corrupting all their experiments. Prompts are now location-agnostic: BENCH = the repo the agent is launched in (verified by CONTEXT.md + scripts/oakbench presence, stop-and-ask otherwise), OAK_SRC = $OAK_REPO or BENCH/../oak (the harness's own resolution), RESEARCH = session-timestamped sibling dir so concurrent orchestrators never collide on worker clones, and an explicit rule: never assume another checkout's absolute path — crossing into another orchestrator's checkout corrupts both experiments.

mrmrs 11d ago

Merged cf9dba41 Mount lane live — first full run (2026-06-11). Setup completed: Oak Mount app v0.96.0 auto-installed, OakFS FSKit extension enabled in System Settings, disposable repo oak/oak-benchmarks-tmp seeded (README/AGENTS/docs/ + 64MB assets/big.bin) and merged to main — oak mount requires a non-empty default branch. Result: 54/58 operations measured. Scenario fixes in this changeset: oak mount end refuses dirty/unpushed mounts (verified product policy), so first_use_read_edit and status_diff_commit_inside_mount now commit+push before teardown — both scenarios had produced zero measurements before today (teardown was broken-by-construction), treated as scenario bug fix, not a rename; huge_file_path wired to assets/big.bin unlocking huge_file_partial_read. First-ever mount numbers (one host, n=1, treat as low-n): mount.start ~1.0s regardless of repo content (2-file repo and 64MB repo identical — lazy hydration confirmed time-wise), teardown ~0.5s, time-to-first-useful-work 1.07s, status/diff/commit INSIDE a mount 10-12ms (local speed), push from mount ~850ms, task lifecycle iteration 2.5s first then ~1.2s steady-state, first WRITE into a mount 1.4-1.7s then ~66ms (write-path warmup = top optimization target), warm mount.start not faster than cold (caching opportunity), space.clean 445ms and correctly tears down only clean+pushed mounts, 64MB file: first 4KB read in 825ms without pulling the file. Measurement caveat: FSKit reports allocated_tree_bytes ~= full logical size, so disk-side accounting cannot distinguish virtual from hydrated blocks — true bytes_hydrated needs oak self-reporting. interrupted_recovery's rc=6 refusal (mount into a dirty destination) is the recorded behavior, not a harness failure.

mrmrs 11d ago

Merged cd5acc0f Swarm-readiness infrastructure, built by a three-agent worker swarm (codex, claude, cursor-agent in tmux) under orchestration; all lanes file-disjoint, integrated and verified by the orchestrator. (1) MEASUREMENT LOCK (codex): oakbench/runlock.py — cross-process flock serializing benchmark measurement so concurrent agents don't poison each other's timings; auto-releases on process death, 30s progress notes naming the holder, OAK_BENCH_LOCK=off escape hatch (single-benchmark machines only), graceful degradation without fcntl. Integrated into workflow_ab/parallel_contention/mount_probe (codex) and bench.py (orchestrator; fixtures build OUTSIDE the lock — builds parallelize, measurement serializes). Run metadata records measurement_lock_wait_ms + held/disabled, null never fabricated. Live-verified: a bench run waited 3.4s behind a 4s holder. (2) OPPORTUNITIES SCOREBOARD (claude): scripts/opportunities.py turns latest results into the swarm's ranked attack list — where-oak-loses-to-git scored by pct-gap x op-family frequency weight (status/diff/commit/proc.spawn=10), absolute quality bar (pipe-compat, recall, determinism, ANSI), oak-only trend targets (remote.net.*, mount), unmeasured coverage grouped by skip_reason with unlock instructions, low-n flagging throughout; deterministic output, optional JSON; 13 tests. First report immediately surfaced: status.dirty +16.7% output bytes vs git, pipe-compat 0%, 25 mount measurements gated on the Mount app install. (3) PARALLEL HYGIENE (cursor-agent): unique per-process suffix (pid+token_hex) on disposable-remote branch names so same-second runs never collide on the server; ensure_fixture builds into a temp dir and atomically renames with lost-race fallback and marker-mismatch error; ResultsStore latest.* copies are atomic (temp + os.replace); 4 tests. (4) SWARM RUNBOOK (orchestrator): AGENTS.md section 'Improving Oak Against These Benchmarks' — the loop (opportunities -> edit ../oak -> devloop verdict) and claim discipline: no claims below the printed noise floor, recall/pipe-compat losses are regressions, never compare network vs local-file transports, measurement serializes while builds parallelize, per-agent results dirs, skips are work items. 131 tests passing. The swarm is launchable; remaining unlock is the Oak Mount app for v0.95.0 on this host.

mrmrs 11d ago

Merged a5f991c6 Reviewer-gap instruments: close the eight under-measured areas from the external agent review. (1) Memory: peak_rss_bytes on every timed command via os.wait4 rusage capture in oakbench/execution.py (no wrapper process in the timed path; bytes on all platforms; null where wait4 is unavailable, never zero). (2) True token cost: tool-call envelope accounting in oakbench/tokens.py — a call is never just command text; additive estimated_tokens_envelope_* and *_with_envelope fields price the tool-use/tool-result JSON framing (45 emitted / 27 ingested per call), constants validated by the new token_calibration.py --envelope mode against Anthropic/OpenAI templates. Historical token fields are byte-identical (row parity, ADR-0005). This makes git 2-call snapshots vs oak 1-call carry their real price. (3) Tail latency: tail_latency_summary (p50/p95/p99/max) in oakbench/reporting.py with sample honesty — p95 null below n=20, p99 below n=100, because nearest-rank below 1/(1-p) samples is the max relabelled; benchmark_stats.py grew p95/p99/max columns with the same rule. (4) Semantic value of output: *.inforecall probe rows in bench.py score status/diff output against the fixture ground-truth changed set — information_recall, bytes_per_changed_file_named; compact-but-lossy output now shows as recall < 1.0 instead of a token win. New summary section renders it. (5) Pipe compatibility: diff.full.inforecall always runs the core-equivalent full diff and records pipe_compatible_unified + hunk/file-header/binary-notice counts (oakbench/output_semantics.py). FINDING: oak full diff is 0% pipe-compatible vs git 100% — the compat risk is now a tracked metric. Counter-finding: oak recall stays 1.0 at ~45 bytes/file vs git diff.full ~8.4MB on binary scenarios — oak compactness is not lossy, and the large-file delta is now citable. (6) Remote/cold cache: remote.push.first / remote.clone.cold / remote.pull.uptodate core-lane ops against a local file:// bare remote (cache_state recorded per row; network jitter excluded by construction); oak emits returncode-77 skip rows until OAK_BENCH_REMOTE wiring lands; --skip-remote for large profiles; skips surface in a new summary section and never fail the run. (7) Mount lifecycle: mount_probe.py handlers for oak.desc, space.clean, and oak.finish (capability-gated via untimed --help probe — skip row on 0.95 binaries, lights up when oak ships finish), plus loop.push_desc (edit/commit/desc/push iterations with per-iteration rows and a .total row) and the task_lifecycle_loop scenario; note the hand-rolled spec parser rejects YAML folded scalars. (8) Compact-output behavior cost: vcs_info_followup_calls_total in the agent lane — a successful read-only VCS command immediately re-asked in the same family (status then status --long) means the first output was insufficient; claude-style streams resolve tool_use_id -> command via a pre-pass map; metric is null when commands are not adapter-visible (ADR-0002). (9) Contention tail: per-commit latency samples and attempts retained raw on worker rows; parallel.total carries commit_latency_tail + retry_depth_histogram (first run: 4 workers, only 8/24 git commits succeeded first attempt, p95 2.6x median). Tests: 90 passing (22 new in tests/test_new_instruments.py covering envelope math, tail honesty, recall/compat oracles, RSS capture, follow-up detection incl. unmeasured gating). Docs: README agent-efficiency + remote + contention sections, benchmark-coverage.md gains 9 evidence rows with public-claim rules, mount-benchmarks.md lifecycle section. Unblocked by config, not code: set OAK_BENCH_REMOTE / OAK_BENCH_MOUNT_REPO at a disposable repo and the oak remote + lifecycle rows start measuring. (14) Round-3 review nits: bytes_read_by_agent no longer fakes file reads with transcript output bytes — nullable in schema, null in this lane (the transcript number stays in command_output_bytes); vcs_metrics_from_tools now derives commands_total AND all family sub-counts from one command list through the shared oakbench.classify.vcs_subcommand (sub-counts can never disagree with the total again; new consistency tests); core-lane rows carry measurement_source. 109 tests passing. (15) Disposable remote wired (oak/oak-benchmarks-tmp): mount.yaml points at it with safe_push enabled (it exists to receive benchmark pushes — never point this at a repo whose history matters), and the core lane now measures REAL Oak server remote ops when OAK_BENCH_REMOTE is set — remote.net.push.first / remote.net.clone.cold / remote.net.fetch.uptodate. Distinct names from git's local-file remote.* ops on purpose (ADR-0005): network vs local-file transport are different measurements, so no report can fabricate a git-vs-oak delta between them; the oak rows answer the release-blocking oak-vs-previous-oak trend. Each run pushes a unique bench-<id> branch (untimed setup) because the server rejects same-named branches with unrelated histories — verified by reproducing the agent-task collision. First live numbers (micro, one host): push 446ms, cold clone 694ms, no-op fetch 281ms. Mount lane remains blocked on this machine: oak.space serves no Oak Mount app for v0.95.0 (HTTP 404) — install via 'make macos-app INSTALL=1' in the oak repo or fix the server publication; mount rows record the failure honestly and skip dependents. The disposable repo accumulates bench branches by design; clean it periodically or recreate it.

mrmrs 11d ago

Merged baca0e47 Two commits. (1) Architecture: extract the oakbench measurement core (scripts/oakbench/: subjects, environment, timed execution, token accounting, fixtures, per-lane row contract validated at write time, command-semantics contract as auditable data, stream-adapter seam, reporting math); lanes become thin scenario definitions, verified measurement-identical via row-level diffs; instrument test suite under tests/; CONTEXT.md glossary + ADRs 0001-0005; compare.py deleted. (2) The agent dev loop: scripts/devloop.py — one command, one verdict (PASS/REGRESSED exit code) for 'did this Oak changeset make things worse'; builds ../oak via cargo, measures the host noise floor with an A/A null test plus per-lane embedded null controls from identical-command steps, gates latency above measured noise, exact-metric efficiency gates vs the Oak baseline (ADR-0004), and fails only NEW Git-guardrail breaches (pre-existing gaps tracked, not failing). scripts/row_parity.py codifies ADR-0005 measurement-identity proofs for harness refactors. bench micro profile + per-lane integration tests (core/contention/mock-agent). Real-adapter verification: ran actual claude 2.1.170 and codex-cli 0.139 bugfix runs; found and fixed two reality gaps — claude stream-json requires --verbose (adapter never worked against the live CLI), and codex turn.completed is session-scoped with cache included in input_tokens (adapter now counts round-trips from agent_message items, never double-counts cache, reports peak context as unmeasured). Sanitized real transcripts promoted to tests/fixtures/*_real.jsonl and pinned by tests (68 total). docs/adding-benchmarks.md playbook; README dev-loop section; coverage table updated (cursor adapter marked unverified).

mrmrs 12d ago

Merged 25ec9ce8 Measurement-accuracy overhaul: schema v2 null-honesty contract (unmeasured = null + measurement_source, never fabricated zeros); turn metrics + per-turn timeline parsed from agent streams; instruction-level familiarity lane (zero-shot/cheat-sheet/full-docs); billing-direction cost-weighted tokens; tuned git modes (untracked-cache/split-index/fsmonitor); output determinism + ANSI probes; proc.spawn overhead rows; history_archaeology + vcs_error_recovery workflows; tokens/turns-to-recovery; parallel contention lane (lock wait, throughput, lost updates, fsck integrity); token_calibration tool for char/4 bias; efficiency regression gates on exact metrics. Post-review fixes: claude/codex turn attribution, read-only oracle shapes, prepare steps in agent lane, grep VCS-dir exclusion, snapshot call accounting, worker exception guard, fsmonitor daemon teardown.

mrmrs 12d ago

Merged 21d5470d mrmrs-01bff3

mrmrs 12d ago

Merged ef77578d Initial Oak benchmark harness

mrmrs 12d ago

Closed

closed without merging

Closed zdgeier-d8eb3e Add a 'Finishing a task — commit and push, every time' rule to AGENTS.md so agents commit and push on task completion (after the instrument tests, no merge) instead of leaving local-only checkpoints. Zzdgeier 3d ago

Closed mrmrs-1e5214 Repair benchmark harness preflights, fixture limits, and failure diagnostics

mrmrs 6d ago

Closed mrmrs-e0f0b8 Harden the branch-triage benchmark lane's command provenance and scoring gates.

mrmrs 7d ago

Closed mrmrs-8a4e56 Validate JSON probes from full stdout capture

mrmrs 8d ago

Closed mrmrs-05258e 50d00e76

mrmrs 10d ago

Closed bench-phase0-fleet-sync sync/divergence recovery + mount-vs-clone lanes. sync_* workflows (workflow_ab): push-divergence/pull-upstream/pull-dirty/non-FF-amend with remote_purpose ctx plumbing, per-run disposable branches, workflow skip rows; live-verified both subjects against oak/bench-sync-tmp. Probed contracts encoded in expected returncodes: git push-reject rc1 / pull-unconfigured rc128 / pull --rebase preserves work; oak 0.96 push rc5, suggested pull rc5, pull --force discards commits (redo steps = recovery cost), and oak pull on a dirty tree SILENTLY discards uncommitted edits (rc0, no warning) — verify.local_payload failing is that measurement. error_mentions_recovery_command on failure rows (git amend non-FF error omits --force-with-lease: captured). mount_vs_clone.py acquisition lane: oak_mount vs git full/shallow/blobless/sparse(3-call) on byte-identical mirrors, time-to-first-read + task-scoped hydration delta via diskprobe, skip-row path contract-validated. row_parity normalizes run timestamps + disposable-branch uniquifiers. Mirrors seeded: github oakdotspace/bench-large-mirror main@f736927 (manifest-verified), oak/bench-large-mirror main single commit (mount verify pending: 2 startup timeouts >120s under load). 174 tests green.

mrmrs 11d ago

Closed mrmrs-14432c No commits yet — —

Closed mrmrs-c23b14 No commits yet — —

Closed mrmrs-da4b8b No commits yet — —

Closed mrmrs-e34a28 No commits yet — —

231 files

082519591e18

claims

cloudflare

config

docs

prompts

scenarios

scripts

tests

.gitignore .oakignore AGENTS.md CONTEXT.md COORDINATION.md LEDGER.md README.md

README.md 417 lines · 17.2 KB

Oak Benchmarks

Oak is version control designed for coding agents: fewer state-management steps, task-oriented branches and spaces, compact machine-readable state, and lazy workspaces that let an agent inspect and edit before paying full checkout cost.

This repository is Oak's evidence system. It compares Oak with Git on work that agents actually do: status, diff, snapshots, branch/task isolation, recovery from bad states, real-agent tool loops, hosted integration, contention, lazy mounts, output bytes, tool calls, token pressure, turn count, and correctness.

The marketing rule is simple: Oak wins where rows prove it. Gaps become explicit skip rows and roadmap items. Public claims must cite raw JSONL, tuned Git baselines, runner identity, sample counts, command track, source provenance, recall/pipe-compatibility checks, and the measured noise floor.

This repo is intentionally separate from the Oak source checkout so agents can change Oak while benchmark scaffolding stays isolated.

Subjects

The default subjects are:

git: stock Git on the same machine.
oak_installed: the Oak binary found on PATH.
oak_local: an optional Oak binary built from a local Oak source checkout.
oak_main: an optional clean-main Oak binary for exact main-vs-local regression checks.

The suite tracks two comparisons:

Oak vs Git: whether Oak's agent-shaped workflow is cheaper, faster, or more reliable than Git for the same task.
Oak vs previous Oak: whether a local Oak changeset improved or regressed the current Oak baseline.

When an Oak source checkout is available, rows include source metadata from oak hash and oak status. Set OAK_REPO or pass --oak-repo when the Oak checkout is not a sibling directory named oak.

Quick Start

oak clone oak/benchmarks benchmarks
cd benchmarks
python3 scripts/bench.py --profile smoke
python3 scripts/regression_report.py results/latest.jsonl

The core harness needs Python 3.9+ and the benchmarked VCS binaries. Other lanes may need credentials or platform capabilities: Git LFS for git_lfs, Linux netns/tc privileges for shaped-network runs, GitHub/Oak disposable repos for hosted/platform lanes, and model CLIs for real-agent campaigns. Missing capabilities should produce skip rows, not silent omissions.

Before adjudicating or starting a cross-repo campaign, render the read-only field map from isolated clones:

python3 scripts/oak_field_map.py \
  --repo /path/to/worker-.../oak \
  --repo /path/to/worker-.../benchmarks

The Dev Loop

For agents developing Oak, devloop.py answers one question: is this Oak changeset better, same, or worse?

python3 scripts/devloop.py                  # cargo-builds ../oak, runs core/workflow/contention
python3 scripts/devloop.py --lanes core     # fastest signal
python3 scripts/devloop.py --oak-local-bin ../oak/target/release/oak --skip-build

devloop measures before it judges. An A/A null test sets the host's noise floor, embedded null controls catch lane noise, and the report prints detection limits. A delta below the measured floor is not a claim.

The verdict gates on the Oak-baseline comparison: latency above noise, tool-call increases, output-byte growth, new failures, information-recall loss, pipe-compatibility loss, and contention integrity. New Git-guardrail breaches are called out; pre-existing Oak-vs-Git gaps are tracked without failing a changeset. Exit 0 means PASS, exit 1 means REGRESSED.

What The Suite Measures

Oak's pitch is not just "faster command." The suite measures the full cost an agent feels:

Wall-clock latency for core VCS operations.
Tool calls and terminal calls.
Agent-emitted and agent-ingested token pressure, including tool-call envelope costs.
Output bytes, truncation, ANSI pollution, determinism, and prompt-cache friendliness.
Information recall and unified-diff pipe compatibility, so compact output cannot win by dropping facts.
Provider-reported tokens, turn counts, recovery cost, and follow-up calls for real coding-agent runs.
Contention, branch/task isolation, merge throughput, and payload integrity.
Mount/lazy-hydration time to first useful work and disk/network work avoided.
Hosted integration waits, API/tool round trips, poll quantization, and branch triage outcomes.

Fixture generation, subject binary discovery, and directory copying are outside the timed window.

Current Coverage

The implemented and partially implemented lanes are:

Lane	Entry point	Status
Core VCS	`scripts/bench.py`	Implemented for init, snapshot, status, diff, branch, task snapshot, remote rows, tuned Git modes, RSS, determinism, recall, pipe compatibility, stats, and regression gates.
Scripted workflows	`scripts/workflow_ab.py`	Implemented for deterministic bugfix, wide refactor, large asset, history archaeology, error recovery, and sync/divergence recovery workflows.
Real agents	`scripts/agent_workflow.py`	Implemented for mock plus local Codex, Claude, and Cursor Agent stream adapters. Use real rows only when the installed CLI stream has campaign evidence.
Parallel contention	`scripts/parallel_contention.py`	Implemented for Git shared/workspace modes; Oak workspace-per-task rows require a disposable Oak remote and otherwise skip honestly.
Mount/lazy hydration	`scripts/mount_probe.py`, `scripts/mount_vs_clone.py`	Dry probe and partial remote-backed coverage. Real mount timing needs `OAK_BENCH_MOUNT_REPO`; large-mirror Oak-vs-Git acquisition is not publishable yet.
Platform lifecycle	`scripts/platform_lifecycle.py`	Implemented for capability probes, hosted integration anatomy, poll-until-merged, race scenarios, and fake-provider branch triage. Real hosted rows need credentials and disposable repos. Excluded from devloop.
Netshape	`scripts/netshape_bench.py`	Implemented where Linux network namespaces and `tc` are available. Cross-subject latency claims require the same shaped pipe and server identity.
Dashboard/control plane	`cloudflare/`	Worker/D1/R2 scaffold exists; live historical dashboard claims require deployed ingest and archived rows.

Coverage details and public-claim rules live in docs/benchmark-coverage.md, docs/publish-checklist.md, and docs/statistical-methodology.md.

In Development And Roadmap

The benchmark suite is also a product map for Oak. Current gaps are intentional work items, not hidden caveats:

Compact agent-facing commands: Oak has verified core-equivalent diff --stat and diff --print mappings in config/command_semantics.json; still needed are short/porcelain or JSON status, diff name-only, and quiet commit output.
Remote-backed Oak rows: Git local-file remote.* rows are implemented; Oak network rows use remote.net.* when a disposable remote is configured. Local file and network transports are never latency-comparable.
Lazy acquisition story: the suite is built to show time-to-first-read and bytes-to-first-read for Oak mount versus Git clone variants, but no public Oak-vs-Git acquisition claim exists until Oak mount succeeds on the large mirror and enough interleaved repetitions are captured.
Workspace-per-agent fleets: Git worktree contention is implemented. Oak workspace-per-task needs real disposable remotes, high-concurrency mount/push behavior, branch cleanup, and merge/integration plumbing.
Full Oak task loop: the roadmap is mount -> edit -> commit -> push -> desc -> finish/clean -> remount follow-up, compared with Git's clone/worktree, branch, commit, push, cleanup, and follow-up flow.
Hosted workflow evidence: platform rows need GitHub org assets, PATs or app credentials, webhooks, disposable repos, and real Oak/GitHub provider wiring before public hosted-branch claims.
XL and monorepo evidence: generators and registry entries exist, but planned or null-manifest fixtures are not performance evidence until consumed by active profiles on pinned runners.
Product dependencies exposed by skip rows include a self-hostable Oak server, bulk history import, webhooks, branch TTL/cleanup, high-concurrency mounts and pushes, 100k-branch scale, and 20 GB repository support.
Additional subject families such as Jujutsu are pilot configuration only until golden outputs and lane wiring land.

The eventual public roadmap should make each of these measurable: a runner class, a disposable repo or fixture, an operation vocabulary, raw JSONL, a fair comparator, and a publish gate.

Profiles

Profile	Intended use	Shapes
`micro`	Noise-floor A/A probes, integration tests, fastest regression signal	Tiny text repo only, 3 reps
`smoke`	Every local change or push	Tiny text repo, one medium binary, a few large binaries
`standard`	Nightly or pre-merge	Many small files, wide dirty tree, 128 MB single file, many 8 MB files
`large`	Dedicated runner	50k small files, 1 GB single file, multi-GB many-large-file repo

Smoke catches regressions. Public speed claims need pinned hardware, enough repetitions, randomization, tuned Git rows, and confidence intervals.

Methodology Guardrails

The machine-readable row is the durable source of truth. Summaries and dashboards are derived views.

Key contracts:

agent-default measures what an agent would naturally call today.
core-equivalent measures the closest equivalent semantic output level across subjects. Rows with compatibility notes are diagnostic, not proof.
Null means unmeasured, never zero.
Returncode 77 means skipped with a skip_reason; skip rows are coverage work items.
Scenario and operation names are immutable identity. Changed semantics need a new name or version.
Network rows and local-file rows are different measurements and are never aggregated together.
Real-agent rows are never aggregated across instruction levels.
Latency compares inside one runner class; portable cross-runner claims use same-run ratios and confidence intervals.
Byte/token wins must cite information recall and pipe compatibility from the same run.

Methodology docs:

docs/command-semantics.md: agent-default versus core-equivalent.
docs/benchmark-coverage.md: implemented evidence versus planned coverage.
docs/publish-checklist.md: public claim gates.
docs/statistical-methodology.md: repetitions, confidence intervals, and overclaiming rules.
docs/tuned-git-baselines.md: credible Git modes beyond stock Git.
docs/runner-discipline.md: runner identity, cache state, environment sampling, and calibration.

What Is Timed

Core VCS rows time operations such as:

repo.init
snapshot.initial
status.clean
status.dirty
diff.dirty
snapshot.dirty
branch.create
task.snapshot
remote.push.first, remote.clone.cold, remote.pull.uptodate for Git local-file remotes
remote.net.push.first, remote.net.clone.cold, remote.net.fetch.uptodate for Oak network remotes when configured

For Git, staging is part of snapshot timing because agents pay that step. For Oak, oak commit --no-verify is the snapshot operation. Tuned Git modes are set up outside the timed region after repo.init.

Remote rows record remote_transport and remote_server. Same-machine Git remote rows isolate VCS transfer cost from network jitter; Oak network rows measure the real Oak server path. Reports must not subtract one from the other as a latency delta.

Agent-Efficiency Metrics

The direct CLI harness records more than wall-clock time:

tool_call_count, vcs_tool_call_count, and terminal_tool_call_count.
estimated_tokens_agent_emitted, estimated_tokens_agent_ingested, and estimated_cost_weighted_tokens.
Tool-call envelope fields, because a terminal call includes model-emitted tool-use JSON and model-ingested tool-result framing.
raw_output_bytes, admitted output counts, truncation flags, and byte counts.
proc.spawn rows for binary startup overhead.
Output determinism and ANSI-in-pipe probes.
peak_rss_bytes on platforms where child resource usage is observable.
status.dirty.inforecall, diff.dirty.inforecall, and diff.full.inforecall probe rows for bytes saved versus information lost.
Tail latency summaries with sample-count honesty: p95 requires n>=20 and p99 requires n>=100.

Tool calls are exact for this harness: each terminal command counts once. Git snapshot operations intentionally count as two calls (git add . plus git commit), while Oak snapshots count as one oak commit.

Token counts in direct CLI rows are estimates, not provider billing tokens. Use scripts/token_calibration.py before publishing cross-subject token deltas. End-to-end agent rows should prefer provider totals from token_metrics.total_tokens_reported when adapters expose them.

Tuned Git Baselines

Stock Git alone can overstate Oak wins. Tuned Git modes are wired as derived subjects:

python3 scripts/bench.py --git-modes untracked_cache,split_index,fsmonitor,lfs

Public wide-tree status/diff tables should include at least git_untracked_cache and git_fsmonitor. git_lfs applies only to configured binary scenarios and emits explicit skip rows when Git LFS is unavailable or not applicable.

Running Lanes

Core:

python3 scripts/bench.py --profile smoke
python3 scripts/regression_report.py results/latest.jsonl
python3 scripts/benchmark_stats.py results/latest.jsonl

Scripted workflows:

python3 scripts/workflow_ab.py --workflows all --runs 3

Real-agent workflow validation:

python3 scripts/agent_workflow.py --list-agents
python3 scripts/agent_workflow.py --agents mock --subjects git,oak_installed

Parallel contention:

python3 scripts/parallel_contention.py --subjects git,oak_installed \
  --workers 2,8,32 --commits-per-worker 5

Mount and acquisition probes:

. config/bench-env.sh  # remote-backed lanes; safe no-secret defaults
python3 scripts/mount_probe.py
OAK_BENCH_MIRROR_REPO=oak/bench-large-mirror \
  [email protected]:oakdotspace/bench-large-mirror.git \
  python3 scripts/mount_vs_clone.py --reps 5

Platform capability or fake-provider branch triage:

python3 scripts/platform_lifecycle.py --scenario platform_capability_probe
python3 scripts/platform_lifecycle.py \
  --platform github \
  --driver fake-provider \
  --scenario branch_triage_n4

Use --track agent-default for current CLI/agent UX and --track core-equivalent when making VCS-mechanics claims.

Results

Each run writes raw JSONL and, when available, a Markdown summary. Core rows use:

results/<timestamp>.jsonl
results/<timestamp>.summary.md
results/latest.jsonl
results/latest.summary.md

Other lanes may write suffixed latest files such as latest.mount.jsonl or latest.platform.jsonl, and long campaigns can flush <timestamp>.<lane>.partial.jsonl as crash insurance before final validation.

Never commit raw real-agent transcripts, generated result JSONL, temporary workspaces, or large benchmark artifacts. Archive publishable raw rows outside the repo and link them from reports.

Subject Configuration

By default, git and oak_installed resolve from PATH. Optional local-build subjects are disabled until explicitly requested. Edit config/subjects.toml or pass overrides:

python3 scripts/bench.py \
  --subjects git,oak_installed,oak_local \
  --oak-local-bin ../oak/target/release/oak \
  --oak-repo ../oak

To benchmark exact main against local changes, build or place both binaries at stable paths and enable oak_main and oak_local, or pass the paths with CLI flags. The harness does not mutate the Oak source checkout.

Architecture

Shared measurement policy lives in scripts/oakbench/: subject identity, environment control, timed execution, token accounting and cost weights, remotes, run locks, row contracts, command semantics, stream adapters, fixture/config IO, runner identity, and reporting math.

Lane scripts stay thin over that core:

bench.py
workflow_ab.py
agent_workflow.py
parallel_contention.py
mount_probe.py
mount_vs_clone.py
platform_lifecycle.py
netshape_bench.py
task_loop.py

Domain vocabulary is in CONTEXT.md; load-bearing decisions are ADRs under docs/adr/.

Testing The Instruments

A benchmark that measures with untested instruments cannot make accuracy claims. The suite ships tests for row contracts, adapter streams, token accounting, command semantics, skip-row coherence, fixture registries, calibration math, oracles, and reporting:

python3 -m unittest discover -s tests

Run the test suite before pushing harness changes. For measurement-policy refactors, use scripts/row_parity.py to prove the new rows are measurement-identical before trusting the refactor.

python3 scripts/row_parity.py old/latest.jsonl new/latest.jsonl

Cloud And CI

Cloudflare is the control and publishing plane, not the latency measurement plane. Workers, D1, and R2 receive and index raw rows; dedicated runners execute benchmarks on pinned hardware.

See docs/cloud.md and docs/infrastructure-plan.md for the recommended setup: per-push smoke, nightly standard, dedicated large-file/monorepo runners, Linux shaped-network runners, macOS mount/agent runners, archived raw rows, and a small dashboard for Oak-vs-Git and Oak-vs-previous-Oak deltas.

# Oak Benchmarks

Oak is version control designed for coding agents: fewer state-management
steps, task-oriented branches and spaces, compact machine-readable state, and
lazy workspaces that let an agent inspect and edit before paying full checkout
cost.

This repository is Oak's evidence system. It compares Oak with Git on work that
agents actually do: status, diff, snapshots, branch/task isolation, recovery
from bad states, real-agent tool loops, hosted integration, contention, lazy
mounts, output bytes, tool calls, token pressure, turn count, and correctness.

The marketing rule is simple: Oak wins where rows prove it. Gaps become
explicit skip rows and roadmap items. Public claims must cite raw JSONL, tuned
Git baselines, runner identity, sample counts, command track, source
provenance, recall/pipe-compatibility checks, and the measured noise floor.

This repo is intentionally separate from the Oak source checkout so agents can
change Oak while benchmark scaffolding stays isolated.

## Subjects

The default subjects are:

- `git`: stock Git on the same machine.
- `oak_installed`: the Oak binary found on `PATH`.
- `oak_local`: an optional Oak binary built from a local Oak source checkout.
- `oak_main`: an optional clean-main Oak binary for exact main-vs-local
  regression checks.

The suite tracks two comparisons:

- **Oak vs Git**: whether Oak's agent-shaped workflow is cheaper, faster, or
  more reliable than Git for the same task.
- **Oak vs previous Oak**: whether a local Oak changeset improved or regressed
  the current Oak baseline.

When an Oak source checkout is available, rows include source metadata from
`oak hash` and `oak status`. Set `OAK_REPO` or pass `--oak-repo` when the Oak
checkout is not a sibling directory named `oak`.

## Quick Start

```bash
oak clone oak/benchmarks benchmarks
cd benchmarks
python3 scripts/bench.py --profile smoke
python3 scripts/regression_report.py results/latest.jsonl
```

The core harness needs Python 3.9+ and the benchmarked VCS binaries. Other
lanes may need credentials or platform capabilities: Git LFS for `git_lfs`,
Linux `netns`/`tc` privileges for shaped-network runs, GitHub/Oak disposable
repos for hosted/platform lanes, and model CLIs for real-agent campaigns.
Missing capabilities should produce skip rows, not silent omissions.

Before adjudicating or starting a cross-repo campaign, render the read-only
field map from isolated clones:

```bash
python3 scripts/oak_field_map.py \
  --repo /path/to/worker-.../oak \
  --repo /path/to/worker-.../benchmarks
```

## The Dev Loop

For agents developing Oak, `devloop.py` answers one question: is this Oak
changeset better, same, or worse?

```bash
python3 scripts/devloop.py                  # cargo-builds ../oak, runs core/workflow/contention
python3 scripts/devloop.py --lanes core     # fastest signal
python3 scripts/devloop.py --oak-local-bin ../oak/target/release/oak --skip-build
```

devloop measures before it judges. An A/A null test sets the host's noise
floor, embedded null controls catch lane noise, and the report prints detection
limits. A delta below the measured floor is not a claim.

The verdict gates on the Oak-baseline comparison: latency above noise, tool-call
increases, output-byte growth, new failures, information-recall loss,
pipe-compatibility loss, and contention integrity. New Git-guardrail breaches
are called out; pre-existing Oak-vs-Git gaps are tracked without failing a
changeset. Exit 0 means PASS, exit 1 means REGRESSED.

## What The Suite Measures

Oak's pitch is not just "faster command." The suite measures the full cost an
agent feels:

- Wall-clock latency for core VCS operations.
- Tool calls and terminal calls.
- Agent-emitted and agent-ingested token pressure, including tool-call
  envelope costs.
- Output bytes, truncation, ANSI pollution, determinism, and prompt-cache
  friendliness.
- Information recall and unified-diff pipe compatibility, so compact output
  cannot win by dropping facts.
- Provider-reported tokens, turn counts, recovery cost, and follow-up calls
  for real coding-agent runs.
- Contention, branch/task isolation, merge throughput, and payload integrity.
- Mount/lazy-hydration time to first useful work and disk/network work avoided.
- Hosted integration waits, API/tool round trips, poll quantization, and branch
  triage outcomes.

Fixture generation, subject binary discovery, and directory copying are outside
the timed window.

## Current Coverage

The implemented and partially implemented lanes are:

| Lane | Entry point | Status |
| --- | --- | --- |
| Core VCS | `scripts/bench.py` | Implemented for init, snapshot, status, diff, branch, task snapshot, remote rows, tuned Git modes, RSS, determinism, recall, pipe compatibility, stats, and regression gates. |
| Scripted workflows | `scripts/workflow_ab.py` | Implemented for deterministic bugfix, wide refactor, large asset, history archaeology, error recovery, and sync/divergence recovery workflows. |
| Real agents | `scripts/agent_workflow.py` | Implemented for mock plus local Codex, Claude, and Cursor Agent stream adapters. Use real rows only when the installed CLI stream has campaign evidence. |
| Parallel contention | `scripts/parallel_contention.py` | Implemented for Git shared/workspace modes; Oak workspace-per-task rows require a disposable Oak remote and otherwise skip honestly. |
| Mount/lazy hydration | `scripts/mount_probe.py`, `scripts/mount_vs_clone.py` | Dry probe and partial remote-backed coverage. Real mount timing needs `OAK_BENCH_MOUNT_REPO`; large-mirror Oak-vs-Git acquisition is not publishable yet. |
| Platform lifecycle | `scripts/platform_lifecycle.py` | Implemented for capability probes, hosted integration anatomy, poll-until-merged, race scenarios, and fake-provider branch triage. Real hosted rows need credentials and disposable repos. Excluded from devloop. |
| Netshape | `scripts/netshape_bench.py` | Implemented where Linux network namespaces and `tc` are available. Cross-subject latency claims require the same shaped pipe and server identity. |
| Dashboard/control plane | `cloudflare/` | Worker/D1/R2 scaffold exists; live historical dashboard claims require deployed ingest and archived rows. |

Coverage details and public-claim rules live in
`docs/benchmark-coverage.md`, `docs/publish-checklist.md`, and
`docs/statistical-methodology.md`.

## In Development And Roadmap

The benchmark suite is also a product map for Oak. Current gaps are intentional
work items, not hidden caveats:

- Compact agent-facing commands: Oak has verified core-equivalent `diff --stat`
  and `diff --print` mappings in `config/command_semantics.json`; still needed
  are short/porcelain or JSON status, diff name-only, and quiet commit output.
- Remote-backed Oak rows: Git local-file `remote.*` rows are implemented; Oak
  network rows use `remote.net.*` when a disposable remote is configured. Local
  file and network transports are never latency-comparable.
- Lazy acquisition story: the suite is built to show time-to-first-read and
  bytes-to-first-read for Oak mount versus Git clone variants, but no public
  Oak-vs-Git acquisition claim exists until Oak mount succeeds on the large
  mirror and enough interleaved repetitions are captured.
- Workspace-per-agent fleets: Git worktree contention is implemented. Oak
  workspace-per-task needs real disposable remotes, high-concurrency mount/push
  behavior, branch cleanup, and merge/integration plumbing.
- Full Oak task loop: the roadmap is mount -> edit -> commit -> push -> desc
  -> finish/clean -> remount follow-up, compared with Git's clone/worktree,
  branch, commit, push, cleanup, and follow-up flow.
- Hosted workflow evidence: platform rows need GitHub org assets, PATs or app
  credentials, webhooks, disposable repos, and real Oak/GitHub provider wiring
  before public hosted-branch claims.
- XL and monorepo evidence: generators and registry entries exist, but planned
  or null-manifest fixtures are not performance evidence until consumed by
  active profiles on pinned runners.
- Product dependencies exposed by skip rows include a self-hostable Oak server,
  bulk history import, webhooks, branch TTL/cleanup, high-concurrency mounts and
  pushes, 100k-branch scale, and 20 GB repository support.
- Additional subject families such as Jujutsu are pilot configuration only
  until golden outputs and lane wiring land.

The eventual public roadmap should make each of these measurable: a runner
class, a disposable repo or fixture, an operation vocabulary, raw JSONL, a
fair comparator, and a publish gate.

## Profiles

| Profile | Intended use | Shapes |
| --- | --- | --- |
| `micro` | Noise-floor A/A probes, integration tests, fastest regression signal | Tiny text repo only, 3 reps |
| `smoke` | Every local change or push | Tiny text repo, one medium binary, a few large binaries |
| `standard` | Nightly or pre-merge | Many small files, wide dirty tree, 128 MB single file, many 8 MB files |
| `large` | Dedicated runner | 50k small files, 1 GB single file, multi-GB many-large-file repo |

Smoke catches regressions. Public speed claims need pinned hardware, enough
repetitions, randomization, tuned Git rows, and confidence intervals.

## Methodology Guardrails

The machine-readable row is the durable source of truth. Summaries and
dashboards are derived views.

Key contracts:

- `agent-default` measures what an agent would naturally call today.
- `core-equivalent` measures the closest equivalent semantic output level
  across subjects. Rows with compatibility notes are diagnostic, not proof.
- Null means unmeasured, never zero.
- Returncode 77 means skipped with a `skip_reason`; skip rows are coverage
  work items.
- Scenario and operation names are immutable identity. Changed semantics need a
  new name or version.
- Network rows and local-file rows are different measurements and are never
  aggregated together.
- Real-agent rows are never aggregated across instruction levels.
- Latency compares inside one runner class; portable cross-runner claims use
  same-run ratios and confidence intervals.
- Byte/token wins must cite information recall and pipe compatibility from the
  same run.

Methodology docs:

- `docs/command-semantics.md`: `agent-default` versus `core-equivalent`.
- `docs/benchmark-coverage.md`: implemented evidence versus planned coverage.
- `docs/publish-checklist.md`: public claim gates.
- `docs/statistical-methodology.md`: repetitions, confidence intervals, and
  overclaiming rules.
- `docs/tuned-git-baselines.md`: credible Git modes beyond stock Git.
- `docs/runner-discipline.md`: runner identity, cache state, environment
  sampling, and calibration.

## What Is Timed

Core VCS rows time operations such as:

- `repo.init`
- `snapshot.initial`
- `status.clean`
- `status.dirty`
- `diff.dirty`
- `snapshot.dirty`
- `branch.create`
- `task.snapshot`
- `remote.push.first`, `remote.clone.cold`, `remote.pull.uptodate` for Git
  local-file remotes
- `remote.net.push.first`, `remote.net.clone.cold`,
  `remote.net.fetch.uptodate` for Oak network remotes when configured

For Git, staging is part of snapshot timing because agents pay that step. For
Oak, `oak commit --no-verify` is the snapshot operation. Tuned Git modes are
set up outside the timed region after `repo.init`.

Remote rows record `remote_transport` and `remote_server`. Same-machine Git
remote rows isolate VCS transfer cost from network jitter; Oak network rows
measure the real Oak server path. Reports must not subtract one from the other
as a latency delta.

## Agent-Efficiency Metrics

The direct CLI harness records more than wall-clock time:

- `tool_call_count`, `vcs_tool_call_count`, and `terminal_tool_call_count`.
- `estimated_tokens_agent_emitted`, `estimated_tokens_agent_ingested`, and
  `estimated_cost_weighted_tokens`.
- Tool-call envelope fields, because a terminal call includes model-emitted
  tool-use JSON and model-ingested tool-result framing.
- `raw_output_bytes`, admitted output counts, truncation flags, and byte
  counts.
- `proc.spawn` rows for binary startup overhead.
- Output determinism and ANSI-in-pipe probes.
- `peak_rss_bytes` on platforms where child resource usage is observable.
- `status.dirty.inforecall`, `diff.dirty.inforecall`, and
  `diff.full.inforecall` probe rows for bytes saved versus information lost.
- Tail latency summaries with sample-count honesty: p95 requires n>=20 and p99
  requires n>=100.

Tool calls are exact for this harness: each terminal command counts once. Git
snapshot operations intentionally count as two calls (`git add .` plus
`git commit`), while Oak snapshots count as one `oak commit`.

Token counts in direct CLI rows are estimates, not provider billing tokens.
Use `scripts/token_calibration.py` before publishing cross-subject token deltas.
End-to-end agent rows should prefer provider totals from
`token_metrics.total_tokens_reported` when adapters expose them.

## Tuned Git Baselines

Stock Git alone can overstate Oak wins. Tuned Git modes are wired as derived
subjects:

```bash
python3 scripts/bench.py --git-modes untracked_cache,split_index,fsmonitor,lfs
```

Public wide-tree status/diff tables should include at least
`git_untracked_cache` and `git_fsmonitor`. `git_lfs` applies only to configured
binary scenarios and emits explicit skip rows when Git LFS is unavailable or
not applicable.

## Running Lanes

Core:

```bash
python3 scripts/bench.py --profile smoke
python3 scripts/regression_report.py results/latest.jsonl
python3 scripts/benchmark_stats.py results/latest.jsonl
```

Scripted workflows:

```bash
python3 scripts/workflow_ab.py --workflows all --runs 3
```

Real-agent workflow validation:

```bash
python3 scripts/agent_workflow.py --list-agents
python3 scripts/agent_workflow.py --agents mock --subjects git,oak_installed
```

Parallel contention:

```bash
python3 scripts/parallel_contention.py --subjects git,oak_installed \
  --workers 2,8,32 --commits-per-worker 5
```

Mount and acquisition probes:

```bash
. config/bench-env.sh  # remote-backed lanes; safe no-secret defaults
python3 scripts/mount_probe.py
OAK_BENCH_MIRROR_REPO=oak/bench-large-mirror \
  [email protected]:oakdotspace/bench-large-mirror.git \
  python3 scripts/mount_vs_clone.py --reps 5
```

Platform capability or fake-provider branch triage:

```bash
python3 scripts/platform_lifecycle.py --scenario platform_capability_probe
python3 scripts/platform_lifecycle.py \
  --platform github \
  --driver fake-provider \
  --scenario branch_triage_n4
```

Use `--track agent-default` for current CLI/agent UX and
`--track core-equivalent` when making VCS-mechanics claims.

## Results

Each run writes raw JSONL and, when available, a Markdown summary. Core rows use:

- `results/<timestamp>.jsonl`
- `results/<timestamp>.summary.md`
- `results/latest.jsonl`
- `results/latest.summary.md`

Other lanes may write suffixed latest files such as `latest.mount.jsonl` or
`latest.platform.jsonl`, and long campaigns can flush
`<timestamp>.<lane>.partial.jsonl` as crash insurance before final validation.

Never commit raw real-agent transcripts, generated result JSONL, temporary
workspaces, or large benchmark artifacts. Archive publishable raw rows outside
the repo and link them from reports.

## Subject Configuration

By default, `git` and `oak_installed` resolve from `PATH`. Optional local-build
subjects are disabled until explicitly requested. Edit `config/subjects.toml`
or pass overrides:

```bash
python3 scripts/bench.py \
  --subjects git,oak_installed,oak_local \
  --oak-local-bin ../oak/target/release/oak \
  --oak-repo ../oak
```

To benchmark exact `main` against local changes, build or place both binaries at
stable paths and enable `oak_main` and `oak_local`, or pass the paths with CLI
flags. The harness does not mutate the Oak source checkout.

## Architecture

Shared measurement policy lives in `scripts/oakbench/`: subject identity,
environment control, timed execution, token accounting and cost weights,
remotes, run locks, row contracts, command semantics, stream adapters,
fixture/config IO, runner identity, and reporting math.

Lane scripts stay thin over that core:

- `bench.py`
- `workflow_ab.py`
- `agent_workflow.py`
- `parallel_contention.py`
- `mount_probe.py`
- `mount_vs_clone.py`
- `platform_lifecycle.py`
- `netshape_bench.py`
- `task_loop.py`

Domain vocabulary is in `CONTEXT.md`; load-bearing decisions are ADRs under
`docs/adr/`.

## Testing The Instruments

A benchmark that measures with untested instruments cannot make accuracy
claims. The suite ships tests for row contracts, adapter streams, token
accounting, command semantics, skip-row coherence, fixture registries,
calibration math, oracles, and reporting:

```bash
python3 -m unittest discover -s tests
```

Run the test suite before pushing harness changes. For measurement-policy
refactors, use `scripts/row_parity.py` to prove the new rows are
measurement-identical before trusting the refactor.

```bash
python3 scripts/row_parity.py old/latest.jsonl new/latest.jsonl
```

## Cloud And CI

Cloudflare is the control and publishing plane, not the latency measurement
plane. Workers, D1, and R2 receive and index raw rows; dedicated runners execute
benchmarks on pinned hardware.

See `docs/cloud.md` and `docs/infrastructure-plan.md` for the recommended
setup: per-push smoke, nightly standard, dedicated large-file/monorepo runners,
Linux shaped-network runners, macOS mount/agent runners, archived raw rows, and
a small dashboard for Oak-vs-Git and Oak-vs-previous-Oak deltas.

1	`# Oak Benchmarks`
2
3	`Oak is version control designed for coding agents: fewer state-management`
4	`steps, task-oriented branches and spaces, compact machine-readable state, and`
5	`lazy workspaces that let an agent inspect and edit before paying full checkout`
6	`cost.`
7
8	`This repository is Oak's evidence system. It compares Oak with Git on work that`
9	`agents actually do: status, diff, snapshots, branch/task isolation, recovery`
10	`from bad states, real-agent tool loops, hosted integration, contention, lazy`
11	`mounts, output bytes, tool calls, token pressure, turn count, and correctness.`
12
13	`The marketing rule is simple: Oak wins where rows prove it. Gaps become`
14	`explicit skip rows and roadmap items. Public claims must cite raw JSONL, tuned`
15	`Git baselines, runner identity, sample counts, command track, source`
16	`provenance, recall/pipe-compatibility checks, and the measured noise floor.`
17
18	`This repo is intentionally separate from the Oak source checkout so agents can`
19	`change Oak while benchmark scaffolding stays isolated.`
20
21	`## Subjects`
22
23	`The default subjects are:`
24
25	- `git`: stock Git on the same machine.
26	- `oak_installed`: the Oak binary found on `PATH`.
27	- `oak_local`: an optional Oak binary built from a local Oak source checkout.
28	- `oak_main`: an optional clean-main Oak binary for exact main-vs-local
29	`regression checks.`
30
31	`The suite tracks two comparisons:`
32
33	`- Oak vs Git: whether Oak's agent-shaped workflow is cheaper, faster, or`
34	`more reliable than Git for the same task.`
35	`- Oak vs previous Oak: whether a local Oak changeset improved or regressed`
36	`the current Oak baseline.`
37
38	`When an Oak source checkout is available, rows include source metadata from`
39	`oak hash` and `oak status`. Set `OAK_REPO` or pass `--oak-repo` when the Oak
40	checkout is not a sibling directory named `oak`.
41
42	`## Quick Start`
43
44	```bash
45	`oak clone oak/benchmarks benchmarks`
46	`cd benchmarks`
47	`python3 scripts/bench.py --profile smoke`
48	`python3 scripts/regression_report.py results/latest.jsonl`
49	```
50
51	`The core harness needs Python 3.9+ and the benchmarked VCS binaries. Other`
52	lanes may need credentials or platform capabilities: Git LFS for `git_lfs`,
53	Linux `netns`/`tc` privileges for shaped-network runs, GitHub/Oak disposable
54	`repos for hosted/platform lanes, and model CLIs for real-agent campaigns.`
55	`Missing capabilities should produce skip rows, not silent omissions.`
56
57	`Before adjudicating or starting a cross-repo campaign, render the read-only`
58	`field map from isolated clones:`
59
60	```bash
61	`python3 scripts/oak_field_map.py \`
62	`--repo /path/to/worker-.../oak \`
63	`--repo /path/to/worker-.../benchmarks`
64	```
65
66	`## The Dev Loop`
67
68	For agents developing Oak, `devloop.py` answers one question: is this Oak
69	`changeset better, same, or worse?`
70
71	```bash
72	`python3 scripts/devloop.py # cargo-builds ../oak, runs core/workflow/contention`
73	`python3 scripts/devloop.py --lanes core # fastest signal`
74	`python3 scripts/devloop.py --oak-local-bin ../oak/target/release/oak --skip-build`
75	```
76
77	`devloop measures before it judges. An A/A null test sets the host's noise`
78	`floor, embedded null controls catch lane noise, and the report prints detection`
79	`limits. A delta below the measured floor is not a claim.`
80
81	`The verdict gates on the Oak-baseline comparison: latency above noise, tool-call`
82	`increases, output-byte growth, new failures, information-recall loss,`
83	`pipe-compatibility loss, and contention integrity. New Git-guardrail breaches`
84	`are called out; pre-existing Oak-vs-Git gaps are tracked without failing a`
85	`changeset. Exit 0 means PASS, exit 1 means REGRESSED.`
86
87	`## What The Suite Measures`
88
89	`Oak's pitch is not just "faster command." The suite measures the full cost an`
90	`agent feels:`
91
92	`- Wall-clock latency for core VCS operations.`
93	`- Tool calls and terminal calls.`
94	`- Agent-emitted and agent-ingested token pressure, including tool-call`
95	`envelope costs.`
96	`- Output bytes, truncation, ANSI pollution, determinism, and prompt-cache`
97	`friendliness.`
98	`- Information recall and unified-diff pipe compatibility, so compact output`
99	`cannot win by dropping facts.`
100	`- Provider-reported tokens, turn counts, recovery cost, and follow-up calls`
101	`for real coding-agent runs.`
102	`- Contention, branch/task isolation, merge throughput, and payload integrity.`
103	`- Mount/lazy-hydration time to first useful work and disk/network work avoided.`
104	`- Hosted integration waits, API/tool round trips, poll quantization, and branch`
105	`triage outcomes.`
106
107	`Fixture generation, subject binary discovery, and directory copying are outside`
108	`the timed window.`
109
110	`## Current Coverage`
111
112	`The implemented and partially implemented lanes are:`
113
114	`\| Lane \| Entry point \| Status \|`
115	`\| --- \| --- \| --- \|`
116	\| Core VCS \| `scripts/bench.py` \| Implemented for init, snapshot, status, diff, branch, task snapshot, remote rows, tuned Git modes, RSS, determinism, recall, pipe compatibility, stats, and regression gates. \|
117	\| Scripted workflows \| `scripts/workflow_ab.py` \| Implemented for deterministic bugfix, wide refactor, large asset, history archaeology, error recovery, and sync/divergence recovery workflows. \|
118	\| Real agents \| `scripts/agent_workflow.py` \| Implemented for mock plus local Codex, Claude, and Cursor Agent stream adapters. Use real rows only when the installed CLI stream has campaign evidence. \|
119	\| Parallel contention \| `scripts/parallel_contention.py` \| Implemented for Git shared/workspace modes; Oak workspace-per-task rows require a disposable Oak remote and otherwise skip honestly. \|
120	\| Mount/lazy hydration \| `scripts/mount_probe.py`, `scripts/mount_vs_clone.py` \| Dry probe and partial remote-backed coverage. Real mount timing needs `OAK_BENCH_MOUNT_REPO`; large-mirror Oak-vs-Git acquisition is not publishable yet. \|
121	\| Platform lifecycle \| `scripts/platform_lifecycle.py` \| Implemented for capability probes, hosted integration anatomy, poll-until-merged, race scenarios, and fake-provider branch triage. Real hosted rows need credentials and disposable repos. Excluded from devloop. \|
122	\| Netshape \| `scripts/netshape_bench.py` \| Implemented where Linux network namespaces and `tc` are available. Cross-subject latency claims require the same shaped pipe and server identity. \|
123	\| Dashboard/control plane \| `cloudflare/` \| Worker/D1/R2 scaffold exists; live historical dashboard claims require deployed ingest and archived rows. \|
124
125	`Coverage details and public-claim rules live in`
126	`docs/benchmark-coverage.md`, `docs/publish-checklist.md`, and
127	`docs/statistical-methodology.md`.
128
129	`## In Development And Roadmap`
130
131	`The benchmark suite is also a product map for Oak. Current gaps are intentional`
132	`work items, not hidden caveats:`
133
134	- Compact agent-facing commands: Oak has verified core-equivalent `diff --stat`
135	and `diff --print` mappings in `config/command_semantics.json`; still needed
136	`are short/porcelain or JSON status, diff name-only, and quiet commit output.`
137	- Remote-backed Oak rows: Git local-file `remote.*` rows are implemented; Oak
138	network rows use `remote.net.*` when a disposable remote is configured. Local
139	`file and network transports are never latency-comparable.`
140	`- Lazy acquisition story: the suite is built to show time-to-first-read and`
141	`bytes-to-first-read for Oak mount versus Git clone variants, but no public`
142	`Oak-vs-Git acquisition claim exists until Oak mount succeeds on the large`
143	`mirror and enough interleaved repetitions are captured.`
144	`- Workspace-per-agent fleets: Git worktree contention is implemented. Oak`
145	`workspace-per-task needs real disposable remotes, high-concurrency mount/push`
146	`behavior, branch cleanup, and merge/integration plumbing.`
147	`- Full Oak task loop: the roadmap is mount -> edit -> commit -> push -> desc`
148	`-> finish/clean -> remount follow-up, compared with Git's clone/worktree,`
149	`branch, commit, push, cleanup, and follow-up flow.`
150	`- Hosted workflow evidence: platform rows need GitHub org assets, PATs or app`
151	`credentials, webhooks, disposable repos, and real Oak/GitHub provider wiring`
152	`before public hosted-branch claims.`
153	`- XL and monorepo evidence: generators and registry entries exist, but planned`
154	`or null-manifest fixtures are not performance evidence until consumed by`
155	`active profiles on pinned runners.`
156	`- Product dependencies exposed by skip rows include a self-hostable Oak server,`
157	`bulk history import, webhooks, branch TTL/cleanup, high-concurrency mounts and`
158	`pushes, 100k-branch scale, and 20 GB repository support.`
159	`- Additional subject families such as Jujutsu are pilot configuration only`
160	`until golden outputs and lane wiring land.`
161
162	`The eventual public roadmap should make each of these measurable: a runner`
163	`class, a disposable repo or fixture, an operation vocabulary, raw JSONL, a`
164	`fair comparator, and a publish gate.`
165
166	`## Profiles`
167
168	`\| Profile \| Intended use \| Shapes \|`
169	`\| --- \| --- \| --- \|`
170	\| `micro` \| Noise-floor A/A probes, integration tests, fastest regression signal \| Tiny text repo only, 3 reps \|
171	\| `smoke` \| Every local change or push \| Tiny text repo, one medium binary, a few large binaries \|
172	\| `standard` \| Nightly or pre-merge \| Many small files, wide dirty tree, 128 MB single file, many 8 MB files \|
173	\| `large` \| Dedicated runner \| 50k small files, 1 GB single file, multi-GB many-large-file repo \|
174
175	`Smoke catches regressions. Public speed claims need pinned hardware, enough`
176	`repetitions, randomization, tuned Git rows, and confidence intervals.`
177
178	`## Methodology Guardrails`
179
180	`The machine-readable row is the durable source of truth. Summaries and`
181	`dashboards are derived views.`
182
183	`Key contracts:`
184
185	- `agent-default` measures what an agent would naturally call today.
186	- `core-equivalent` measures the closest equivalent semantic output level
187	`across subjects. Rows with compatibility notes are diagnostic, not proof.`
188	`- Null means unmeasured, never zero.`
189	- Returncode 77 means skipped with a `skip_reason`; skip rows are coverage
190	`work items.`
191	`- Scenario and operation names are immutable identity. Changed semantics need a`
192	`new name or version.`
193	`- Network rows and local-file rows are different measurements and are never`
194	`aggregated together.`
195	`- Real-agent rows are never aggregated across instruction levels.`
196	`- Latency compares inside one runner class; portable cross-runner claims use`
197	`same-run ratios and confidence intervals.`
198	`- Byte/token wins must cite information recall and pipe compatibility from the`
199	`same run.`
200
201	`Methodology docs:`
202
203	- `docs/command-semantics.md`: `agent-default` versus `core-equivalent`.
204	- `docs/benchmark-coverage.md`: implemented evidence versus planned coverage.
205	- `docs/publish-checklist.md`: public claim gates.
206	- `docs/statistical-methodology.md`: repetitions, confidence intervals, and
207	`overclaiming rules.`
208	- `docs/tuned-git-baselines.md`: credible Git modes beyond stock Git.
209	- `docs/runner-discipline.md`: runner identity, cache state, environment
210	`sampling, and calibration.`
211
212	`## What Is Timed`
213
214	`Core VCS rows time operations such as:`
215
216	- `repo.init`
217	- `snapshot.initial`
218	- `status.clean`
219	- `status.dirty`
220	- `diff.dirty`
221	- `snapshot.dirty`
222	- `branch.create`
223	- `task.snapshot`
224	- `remote.push.first`, `remote.clone.cold`, `remote.pull.uptodate` for Git
225	`local-file remotes`
226	- `remote.net.push.first`, `remote.net.clone.cold`,
227	`remote.net.fetch.uptodate` for Oak network remotes when configured
228
229	`For Git, staging is part of snapshot timing because agents pay that step. For`
230	Oak, `oak commit --no-verify` is the snapshot operation. Tuned Git modes are
231	set up outside the timed region after `repo.init`.
232
233	Remote rows record `remote_transport` and `remote_server`. Same-machine Git
234	`remote rows isolate VCS transfer cost from network jitter; Oak network rows`
235	`measure the real Oak server path. Reports must not subtract one from the other`
236	`as a latency delta.`
237
238	`## Agent-Efficiency Metrics`
239
240	`The direct CLI harness records more than wall-clock time:`
241
242	- `tool_call_count`, `vcs_tool_call_count`, and `terminal_tool_call_count`.
243	- `estimated_tokens_agent_emitted`, `estimated_tokens_agent_ingested`, and
244	`estimated_cost_weighted_tokens`.
245	`- Tool-call envelope fields, because a terminal call includes model-emitted`
246	`tool-use JSON and model-ingested tool-result framing.`
247	- `raw_output_bytes`, admitted output counts, truncation flags, and byte
248	`counts.`
249	- `proc.spawn` rows for binary startup overhead.
250	`- Output determinism and ANSI-in-pipe probes.`
251	- `peak_rss_bytes` on platforms where child resource usage is observable.
252	- `status.dirty.inforecall`, `diff.dirty.inforecall`, and
253	`diff.full.inforecall` probe rows for bytes saved versus information lost.
254	`- Tail latency summaries with sample-count honesty: p95 requires n>=20 and p99`
255	`requires n>=100.`
256
257	`Tool calls are exact for this harness: each terminal command counts once. Git`
258	snapshot operations intentionally count as two calls (`git add .` plus
259	`git commit`), while Oak snapshots count as one `oak commit`.
260
261	`Token counts in direct CLI rows are estimates, not provider billing tokens.`
262	Use `scripts/token_calibration.py` before publishing cross-subject token deltas.
263	`End-to-end agent rows should prefer provider totals from`
264	`token_metrics.total_tokens_reported` when adapters expose them.
265
266	`## Tuned Git Baselines`
267
268	`Stock Git alone can overstate Oak wins. Tuned Git modes are wired as derived`
269	`subjects:`
270
271	```bash
272	`python3 scripts/bench.py --git-modes untracked_cache,split_index,fsmonitor,lfs`
273	```
274
275	`Public wide-tree status/diff tables should include at least`
276	`git_untracked_cache` and `git_fsmonitor`. `git_lfs` applies only to configured
277	`binary scenarios and emits explicit skip rows when Git LFS is unavailable or`
278	`not applicable.`
279
280	`## Running Lanes`
281
282	`Core:`
283
284	```bash
285	`python3 scripts/bench.py --profile smoke`
286	`python3 scripts/regression_report.py results/latest.jsonl`
287	`python3 scripts/benchmark_stats.py results/latest.jsonl`
288	```
289
290	`Scripted workflows:`
291
292	```bash
293	`python3 scripts/workflow_ab.py --workflows all --runs 3`
294	```
295
296	`Real-agent workflow validation:`
297
298	```bash
299	`python3 scripts/agent_workflow.py --list-agents`
300	`python3 scripts/agent_workflow.py --agents mock --subjects git,oak_installed`
301	```
302
303	`Parallel contention:`
304
305	```bash
306	`python3 scripts/parallel_contention.py --subjects git,oak_installed \`
307	`--workers 2,8,32 --commits-per-worker 5`
308	```
309
310	`Mount and acquisition probes:`
311
312	```bash
313	`. config/bench-env.sh # remote-backed lanes; safe no-secret defaults`
314	`python3 scripts/mount_probe.py`
315	`OAK_BENCH_MIRROR_REPO=oak/bench-large-mirror \`
316	`[email protected]:oakdotspace/bench-large-mirror.git \`
317	`python3 scripts/mount_vs_clone.py --reps 5`
318	```
319
320	`Platform capability or fake-provider branch triage:`
321
322	```bash
323	`python3 scripts/platform_lifecycle.py --scenario platform_capability_probe`
324	`python3 scripts/platform_lifecycle.py \`
325	`--platform github \`
326	`--driver fake-provider \`
327	`--scenario branch_triage_n4`
328	```
329
330	Use `--track agent-default` for current CLI/agent UX and
331	`--track core-equivalent` when making VCS-mechanics claims.
332
333	`## Results`
334
335	`Each run writes raw JSONL and, when available, a Markdown summary. Core rows use:`
336
337	- `results/<timestamp>.jsonl`
338	- `results/<timestamp>.summary.md`
339	- `results/latest.jsonl`
340	- `results/latest.summary.md`
341
342	Other lanes may write suffixed latest files such as `latest.mount.jsonl` or
343	`latest.platform.jsonl`, and long campaigns can flush
344	`<timestamp>.<lane>.partial.jsonl` as crash insurance before final validation.
345
346	`Never commit raw real-agent transcripts, generated result JSONL, temporary`
347	`workspaces, or large benchmark artifacts. Archive publishable raw rows outside`
348	`the repo and link them from reports.`
349
350	`## Subject Configuration`
351
352	By default, `git` and `oak_installed` resolve from `PATH`. Optional local-build
353	subjects are disabled until explicitly requested. Edit `config/subjects.toml`
354	`or pass overrides:`
355
356	```bash
357	`python3 scripts/bench.py \`
358	`--subjects git,oak_installed,oak_local \`
359	`--oak-local-bin ../oak/target/release/oak \`
360	`--oak-repo ../oak`
361	```
362
363	To benchmark exact `main` against local changes, build or place both binaries at
364	stable paths and enable `oak_main` and `oak_local`, or pass the paths with CLI
365	`flags. The harness does not mutate the Oak source checkout.`
366
367	`## Architecture`
368
369	Shared measurement policy lives in `scripts/oakbench/`: subject identity,
370	`environment control, timed execution, token accounting and cost weights,`
371	`remotes, run locks, row contracts, command semantics, stream adapters,`
372	`fixture/config IO, runner identity, and reporting math.`
373
374	`Lane scripts stay thin over that core:`
375
376	- `bench.py`
377	- `workflow_ab.py`
378	- `agent_workflow.py`
379	- `parallel_contention.py`
380	- `mount_probe.py`
381	- `mount_vs_clone.py`
382	- `platform_lifecycle.py`
383	- `netshape_bench.py`
384	- `task_loop.py`
385
386	Domain vocabulary is in `CONTEXT.md`; load-bearing decisions are ADRs under
387	`docs/adr/`.
388
389	`## Testing The Instruments`
390
391	`A benchmark that measures with untested instruments cannot make accuracy`
392	`claims. The suite ships tests for row contracts, adapter streams, token`
393	`accounting, command semantics, skip-row coherence, fixture registries,`
394	`calibration math, oracles, and reporting:`
395
396	```bash
397	`python3 -m unittest discover -s tests`
398	```
399
400	`Run the test suite before pushing harness changes. For measurement-policy`
401	refactors, use `scripts/row_parity.py` to prove the new rows are
402	`measurement-identical before trusting the refactor.`
403
404	```bash
405	`python3 scripts/row_parity.py old/latest.jsonl new/latest.jsonl`
406	```
407
408	`## Cloud And CI`
409
410	`Cloudflare is the control and publishing plane, not the latency measurement`
411	`plane. Workers, D1, and R2 receive and index raw rows; dedicated runners execute`
412	`benchmarks on pinned hardware.`
413
414	See `docs/cloud.md` and `docs/infrastructure-plan.md` for the recommended
415	`setup: per-push smoke, nightly standard, dedicated large-file/monorepo runners,`
416	`Linux shaped-network runners, macOS mount/agent runners, archived raw rows, and`
417	`a small dashboard for Oak-vs-Git and Oak-vs-previous-Oak deltas.`