Oak Benchmarks
Oak is version control designed for coding agents: fewer state-management steps, task-oriented branches and spaces, compact machine-readable state, and lazy workspaces that let an agent inspect and edit before paying full checkout cost.
This repository is Oak's evidence system. It compares Oak with Git on work that agents actually do: status, diff, snapshots, branch/task isolation, recovery from bad states, real-agent tool loops, hosted integration, contention, lazy mounts, output bytes, tool calls, token pressure, turn count, and correctness.
The marketing rule is simple: Oak wins where rows prove it. Gaps become explicit skip rows and roadmap items. Public claims must cite raw JSONL, tuned Git baselines, runner identity, sample counts, command track, source provenance, recall/pipe-compatibility checks, and the measured noise floor.
This repo is intentionally separate from the Oak source checkout so agents can change Oak while benchmark scaffolding stays isolated.
Subjects
The default subjects are:
git: stock Git on the same machine.oak_installed: the Oak binary found onPATH.oak_local: an optional Oak binary built from a local Oak source checkout.oak_main: an optional clean-main Oak binary for exact main-vs-local regression checks.
The suite tracks two comparisons:
- Oak vs Git: whether Oak's agent-shaped workflow is cheaper, faster, or more reliable than Git for the same task.
- Oak vs previous Oak: whether a local Oak changeset improved or regressed the current Oak baseline.
When an Oak source checkout is available, rows include source metadata from
oak hash and oak status. Set OAK_REPO or pass --oak-repo when the Oak
checkout is not a sibling directory named oak.
Quick Start
oak clone oak/benchmarks benchmarks
cd benchmarks
python3 scripts/bench.py --profile smoke
python3 scripts/regression_report.py results/latest.jsonl
The core harness needs Python 3.9+ and the benchmarked VCS binaries. Other
lanes may need credentials or platform capabilities: Git LFS for git_lfs,
Linux netns/tc privileges for shaped-network runs, GitHub/Oak disposable
repos for hosted/platform lanes, and model CLIs for real-agent campaigns.
Missing capabilities should produce skip rows, not silent omissions.
Before adjudicating or starting a cross-repo campaign, render the read-only field map from isolated clones:
python3 scripts/oak_field_map.py \
--repo /path/to/worker-.../oak \
--repo /path/to/worker-.../benchmarks
The Dev Loop
For agents developing Oak, devloop.py answers one question: is this Oak
changeset better, same, or worse?
python3 scripts/devloop.py # cargo-builds ../oak, runs core/workflow/contention
python3 scripts/devloop.py --lanes core # fastest signal
python3 scripts/devloop.py --oak-local-bin ../oak/target/release/oak --skip-build
devloop measures before it judges. An A/A null test sets the host's noise floor, embedded null controls catch lane noise, and the report prints detection limits. A delta below the measured floor is not a claim.
The verdict gates on the Oak-baseline comparison: latency above noise, tool-call increases, output-byte growth, new failures, information-recall loss, pipe-compatibility loss, and contention integrity. New Git-guardrail breaches are called out; pre-existing Oak-vs-Git gaps are tracked without failing a changeset. Exit 0 means PASS, exit 1 means REGRESSED.
What The Suite Measures
Oak's pitch is not just "faster command." The suite measures the full cost an agent feels:
- Wall-clock latency for core VCS operations.
- Tool calls and terminal calls.
- Agent-emitted and agent-ingested token pressure, including tool-call envelope costs.
- Output bytes, truncation, ANSI pollution, determinism, and prompt-cache friendliness.
- Information recall and unified-diff pipe compatibility, so compact output cannot win by dropping facts.
- Provider-reported tokens, turn counts, recovery cost, and follow-up calls for real coding-agent runs.
- Contention, branch/task isolation, merge throughput, and payload integrity.
- Mount/lazy-hydration time to first useful work and disk/network work avoided.
- Hosted integration waits, API/tool round trips, poll quantization, and branch triage outcomes.
Fixture generation, subject binary discovery, and directory copying are outside the timed window.
Current Coverage
The implemented and partially implemented lanes are:
| Lane | Entry point | Status |
|---|---|---|
| Core VCS | scripts/bench.py | Implemented for init, snapshot, status, diff, branch, task snapshot, remote rows, tuned Git modes, RSS, determinism, recall, pipe compatibility, stats, and regression gates. |
| Scripted workflows | scripts/workflow_ab.py | Implemented for deterministic bugfix, wide refactor, large asset, history archaeology, error recovery, and sync/divergence recovery workflows. |
| Real agents | scripts/agent_workflow.py | Implemented for mock plus local Codex, Claude, and Cursor Agent stream adapters. Use real rows only when the installed CLI stream has campaign evidence. |
| Parallel contention | scripts/parallel_contention.py | Implemented for Git shared/workspace modes; Oak workspace-per-task rows require a disposable Oak remote and otherwise skip honestly. |
| Mount/lazy hydration | scripts/mount_probe.py, scripts/mount_vs_clone.py | Dry probe and partial remote-backed coverage. Real mount timing needs OAK_BENCH_MOUNT_REPO; large-mirror Oak-vs-Git acquisition is not publishable yet. |
| Platform lifecycle | scripts/platform_lifecycle.py | Implemented for capability probes, hosted integration anatomy, poll-until-merged, race scenarios, and fake-provider branch triage. Real hosted rows need credentials and disposable repos. Excluded from devloop. |
| Netshape | scripts/netshape_bench.py | Implemented where Linux network namespaces and tc are available. Cross-subject latency claims require the same shaped pipe and server identity. |
| Dashboard/control plane | cloudflare/ | Worker/D1/R2 scaffold exists; live historical dashboard claims require deployed ingest and archived rows. |
Coverage details and public-claim rules live in
docs/benchmark-coverage.md, docs/publish-checklist.md, and
docs/statistical-methodology.md.
In Development And Roadmap
The benchmark suite is also a product map for Oak. Current gaps are intentional work items, not hidden caveats:
- Compact agent-facing commands: Oak has verified core-equivalent
diff --statanddiff --printmappings inconfig/command_semantics.json; still needed are short/porcelain or JSON status, diff name-only, and quiet commit output. - Remote-backed Oak rows: Git local-file
remote.*rows are implemented; Oak network rows useremote.net.*when a disposable remote is configured. Local file and network transports are never latency-comparable. - Lazy acquisition story: the suite is built to show time-to-first-read and bytes-to-first-read for Oak mount versus Git clone variants, but no public Oak-vs-Git acquisition claim exists until Oak mount succeeds on the large mirror and enough interleaved repetitions are captured.
- Workspace-per-agent fleets: Git worktree contention is implemented. Oak workspace-per-task needs real disposable remotes, high-concurrency mount/push behavior, branch cleanup, and merge/integration plumbing.
- Full Oak task loop: the roadmap is mount -> edit -> commit -> push -> desc -> finish/clean -> remount follow-up, compared with Git's clone/worktree, branch, commit, push, cleanup, and follow-up flow.
- Hosted workflow evidence: platform rows need GitHub org assets, PATs or app credentials, webhooks, disposable repos, and real Oak/GitHub provider wiring before public hosted-branch claims.
- XL and monorepo evidence: generators and registry entries exist, but planned or null-manifest fixtures are not performance evidence until consumed by active profiles on pinned runners.
- Product dependencies exposed by skip rows include a self-hostable Oak server, bulk history import, webhooks, branch TTL/cleanup, high-concurrency mounts and pushes, 100k-branch scale, and 20 GB repository support.
- Additional subject families such as Jujutsu are pilot configuration only until golden outputs and lane wiring land.
The eventual public roadmap should make each of these measurable: a runner class, a disposable repo or fixture, an operation vocabulary, raw JSONL, a fair comparator, and a publish gate.
Profiles
| Profile | Intended use | Shapes |
|---|---|---|
micro | Noise-floor A/A probes, integration tests, fastest regression signal | Tiny text repo only, 3 reps |
smoke | Every local change or push | Tiny text repo, one medium binary, a few large binaries |
standard | Nightly or pre-merge | Many small files, wide dirty tree, 128 MB single file, many 8 MB files |
large | Dedicated runner | 50k small files, 1 GB single file, multi-GB many-large-file repo |
Smoke catches regressions. Public speed claims need pinned hardware, enough repetitions, randomization, tuned Git rows, and confidence intervals.
Methodology Guardrails
The machine-readable row is the durable source of truth. Summaries and dashboards are derived views.
Key contracts:
agent-defaultmeasures what an agent would naturally call today.core-equivalentmeasures the closest equivalent semantic output level across subjects. Rows with compatibility notes are diagnostic, not proof.- Null means unmeasured, never zero.
- Returncode 77 means skipped with a
skip_reason; skip rows are coverage work items. - Scenario and operation names are immutable identity. Changed semantics need a new name or version.
- Network rows and local-file rows are different measurements and are never aggregated together.
- Real-agent rows are never aggregated across instruction levels.
- Latency compares inside one runner class; portable cross-runner claims use same-run ratios and confidence intervals.
- Byte/token wins must cite information recall and pipe compatibility from the same run.
Methodology docs:
docs/command-semantics.md:agent-defaultversuscore-equivalent.docs/benchmark-coverage.md: implemented evidence versus planned coverage.docs/publish-checklist.md: public claim gates.docs/statistical-methodology.md: repetitions, confidence intervals, and overclaiming rules.docs/tuned-git-baselines.md: credible Git modes beyond stock Git.docs/runner-discipline.md: runner identity, cache state, environment sampling, and calibration.
What Is Timed
Core VCS rows time operations such as:
repo.initsnapshot.initialstatus.cleanstatus.dirtydiff.dirtysnapshot.dirtybranch.createtask.snapshotremote.push.first,remote.clone.cold,remote.pull.uptodatefor Git local-file remotesremote.net.push.first,remote.net.clone.cold,remote.net.fetch.uptodatefor Oak network remotes when configured
For Git, staging is part of snapshot timing because agents pay that step. For
Oak, oak commit --no-verify is the snapshot operation. Tuned Git modes are
set up outside the timed region after repo.init.
Remote rows record remote_transport and remote_server. Same-machine Git
remote rows isolate VCS transfer cost from network jitter; Oak network rows
measure the real Oak server path. Reports must not subtract one from the other
as a latency delta.
Agent-Efficiency Metrics
The direct CLI harness records more than wall-clock time:
tool_call_count,vcs_tool_call_count, andterminal_tool_call_count.estimated_tokens_agent_emitted,estimated_tokens_agent_ingested, andestimated_cost_weighted_tokens.- Tool-call envelope fields, because a terminal call includes model-emitted tool-use JSON and model-ingested tool-result framing.
raw_output_bytes, admitted output counts, truncation flags, and byte counts.proc.spawnrows for binary startup overhead.- Output determinism and ANSI-in-pipe probes.
peak_rss_byteson platforms where child resource usage is observable.status.dirty.inforecall,diff.dirty.inforecall, anddiff.full.inforecallprobe rows for bytes saved versus information lost.- Tail latency summaries with sample-count honesty: p95 requires n>=20 and p99 requires n>=100.
Tool calls are exact for this harness: each terminal command counts once. Git
snapshot operations intentionally count as two calls (git add . plus
git commit), while Oak snapshots count as one oak commit.
Token counts in direct CLI rows are estimates, not provider billing tokens.
Use scripts/token_calibration.py before publishing cross-subject token deltas.
End-to-end agent rows should prefer provider totals from
token_metrics.total_tokens_reported when adapters expose them.
Tuned Git Baselines
Stock Git alone can overstate Oak wins. Tuned Git modes are wired as derived subjects:
python3 scripts/bench.py --git-modes untracked_cache,split_index,fsmonitor,lfs
Public wide-tree status/diff tables should include at least
git_untracked_cache and git_fsmonitor. git_lfs applies only to configured
binary scenarios and emits explicit skip rows when Git LFS is unavailable or
not applicable.
Running Lanes
Core:
python3 scripts/bench.py --profile smoke
python3 scripts/regression_report.py results/latest.jsonl
python3 scripts/benchmark_stats.py results/latest.jsonl
Scripted workflows:
python3 scripts/workflow_ab.py --workflows all --runs 3
Real-agent workflow validation:
python3 scripts/agent_workflow.py --list-agents
python3 scripts/agent_workflow.py --agents mock --subjects git,oak_installed
Parallel contention:
python3 scripts/parallel_contention.py --subjects git,oak_installed \
--workers 2,8,32 --commits-per-worker 5
Mount and acquisition probes:
. config/bench-env.sh # remote-backed lanes; safe no-secret defaults
python3 scripts/mount_probe.py
OAK_BENCH_MIRROR_REPO=oak/bench-large-mirror \
[email protected]:oakdotspace/bench-large-mirror.git \
python3 scripts/mount_vs_clone.py --reps 5
Platform capability or fake-provider branch triage:
python3 scripts/platform_lifecycle.py --scenario platform_capability_probe
python3 scripts/platform_lifecycle.py \
--platform github \
--driver fake-provider \
--scenario branch_triage_n4
Use --track agent-default for current CLI/agent UX and
--track core-equivalent when making VCS-mechanics claims.
Results
Each run writes raw JSONL and, when available, a Markdown summary. Core rows use:
results/<timestamp>.jsonlresults/<timestamp>.summary.mdresults/latest.jsonlresults/latest.summary.md
Other lanes may write suffixed latest files such as latest.mount.jsonl or
latest.platform.jsonl, and long campaigns can flush
<timestamp>.<lane>.partial.jsonl as crash insurance before final validation.
Never commit raw real-agent transcripts, generated result JSONL, temporary workspaces, or large benchmark artifacts. Archive publishable raw rows outside the repo and link them from reports.
Subject Configuration
By default, git and oak_installed resolve from PATH. Optional local-build
subjects are disabled until explicitly requested. Edit config/subjects.toml
or pass overrides:
python3 scripts/bench.py \
--subjects git,oak_installed,oak_local \
--oak-local-bin ../oak/target/release/oak \
--oak-repo ../oak
To benchmark exact main against local changes, build or place both binaries at
stable paths and enable oak_main and oak_local, or pass the paths with CLI
flags. The harness does not mutate the Oak source checkout.
Architecture
Shared measurement policy lives in scripts/oakbench/: subject identity,
environment control, timed execution, token accounting and cost weights,
remotes, run locks, row contracts, command semantics, stream adapters,
fixture/config IO, runner identity, and reporting math.
Lane scripts stay thin over that core:
bench.pyworkflow_ab.pyagent_workflow.pyparallel_contention.pymount_probe.pymount_vs_clone.pyplatform_lifecycle.pynetshape_bench.pytask_loop.py
Domain vocabulary is in CONTEXT.md; load-bearing decisions are ADRs under
docs/adr/.
Testing The Instruments
A benchmark that measures with untested instruments cannot make accuracy claims. The suite ships tests for row contracts, adapter streams, token accounting, command semantics, skip-row coherence, fixture registries, calibration math, oracles, and reporting:
python3 -m unittest discover -s tests
Run the test suite before pushing harness changes. For measurement-policy
refactors, use scripts/row_parity.py to prove the new rows are
measurement-identical before trusting the refactor.
python3 scripts/row_parity.py old/latest.jsonl new/latest.jsonl
Cloud And CI
Cloudflare is the control and publishing plane, not the latency measurement plane. Workers, D1, and R2 receive and index raw rows; dedicated runners execute benchmarks on pinned hardware.
See docs/cloud.md and docs/infrastructure-plan.md for the recommended
setup: per-push smoke, nightly standard, dedicated large-file/monorepo runners,
Linux shaped-network runners, macOS mount/agent runners, archived raw rows, and
a small dashboard for Oak-vs-Git and Oak-vs-previous-Oak deltas.