Log in
README.md 417 lines · 17.2 KB

Oak Benchmarks

Oak is version control designed for coding agents: fewer state-management steps, task-oriented branches and spaces, compact machine-readable state, and lazy workspaces that let an agent inspect and edit before paying full checkout cost.

This repository is Oak's evidence system. It compares Oak with Git on work that agents actually do: status, diff, snapshots, branch/task isolation, recovery from bad states, real-agent tool loops, hosted integration, contention, lazy mounts, output bytes, tool calls, token pressure, turn count, and correctness.

The marketing rule is simple: Oak wins where rows prove it. Gaps become explicit skip rows and roadmap items. Public claims must cite raw JSONL, tuned Git baselines, runner identity, sample counts, command track, source provenance, recall/pipe-compatibility checks, and the measured noise floor.

This repo is intentionally separate from the Oak source checkout so agents can change Oak while benchmark scaffolding stays isolated.

Subjects

The default subjects are:

  • git: stock Git on the same machine.
  • oak_installed: the Oak binary found on PATH.
  • oak_local: an optional Oak binary built from a local Oak source checkout.
  • oak_main: an optional clean-main Oak binary for exact main-vs-local regression checks.

The suite tracks two comparisons:

  • Oak vs Git: whether Oak's agent-shaped workflow is cheaper, faster, or more reliable than Git for the same task.
  • Oak vs previous Oak: whether a local Oak changeset improved or regressed the current Oak baseline.

When an Oak source checkout is available, rows include source metadata from oak hash and oak status. Set OAK_REPO or pass --oak-repo when the Oak checkout is not a sibling directory named oak.

Quick Start

oak clone oak/benchmarks benchmarks
cd benchmarks
python3 scripts/bench.py --profile smoke
python3 scripts/regression_report.py results/latest.jsonl

The core harness needs Python 3.9+ and the benchmarked VCS binaries. Other lanes may need credentials or platform capabilities: Git LFS for git_lfs, Linux netns/tc privileges for shaped-network runs, GitHub/Oak disposable repos for hosted/platform lanes, and model CLIs for real-agent campaigns. Missing capabilities should produce skip rows, not silent omissions.

Before adjudicating or starting a cross-repo campaign, render the read-only field map from isolated clones:

python3 scripts/oak_field_map.py \
  --repo /path/to/worker-.../oak \
  --repo /path/to/worker-.../benchmarks

The Dev Loop

For agents developing Oak, devloop.py answers one question: is this Oak changeset better, same, or worse?

python3 scripts/devloop.py                  # cargo-builds ../oak, runs core/workflow/contention
python3 scripts/devloop.py --lanes core     # fastest signal
python3 scripts/devloop.py --oak-local-bin ../oak/target/release/oak --skip-build

devloop measures before it judges. An A/A null test sets the host's noise floor, embedded null controls catch lane noise, and the report prints detection limits. A delta below the measured floor is not a claim.

The verdict gates on the Oak-baseline comparison: latency above noise, tool-call increases, output-byte growth, new failures, information-recall loss, pipe-compatibility loss, and contention integrity. New Git-guardrail breaches are called out; pre-existing Oak-vs-Git gaps are tracked without failing a changeset. Exit 0 means PASS, exit 1 means REGRESSED.

What The Suite Measures

Oak's pitch is not just "faster command." The suite measures the full cost an agent feels:

  • Wall-clock latency for core VCS operations.
  • Tool calls and terminal calls.
  • Agent-emitted and agent-ingested token pressure, including tool-call envelope costs.
  • Output bytes, truncation, ANSI pollution, determinism, and prompt-cache friendliness.
  • Information recall and unified-diff pipe compatibility, so compact output cannot win by dropping facts.
  • Provider-reported tokens, turn counts, recovery cost, and follow-up calls for real coding-agent runs.
  • Contention, branch/task isolation, merge throughput, and payload integrity.
  • Mount/lazy-hydration time to first useful work and disk/network work avoided.
  • Hosted integration waits, API/tool round trips, poll quantization, and branch triage outcomes.

Fixture generation, subject binary discovery, and directory copying are outside the timed window.

Current Coverage

The implemented and partially implemented lanes are:

LaneEntry pointStatus
Core VCSscripts/bench.pyImplemented for init, snapshot, status, diff, branch, task snapshot, remote rows, tuned Git modes, RSS, determinism, recall, pipe compatibility, stats, and regression gates.
Scripted workflowsscripts/workflow_ab.pyImplemented for deterministic bugfix, wide refactor, large asset, history archaeology, error recovery, and sync/divergence recovery workflows.
Real agentsscripts/agent_workflow.pyImplemented for mock plus local Codex, Claude, and Cursor Agent stream adapters. Use real rows only when the installed CLI stream has campaign evidence.
Parallel contentionscripts/parallel_contention.pyImplemented for Git shared/workspace modes; Oak workspace-per-task rows require a disposable Oak remote and otherwise skip honestly.
Mount/lazy hydrationscripts/mount_probe.py, scripts/mount_vs_clone.pyDry probe and partial remote-backed coverage. Real mount timing needs OAK_BENCH_MOUNT_REPO; large-mirror Oak-vs-Git acquisition is not publishable yet.
Platform lifecyclescripts/platform_lifecycle.pyImplemented for capability probes, hosted integration anatomy, poll-until-merged, race scenarios, and fake-provider branch triage. Real hosted rows need credentials and disposable repos. Excluded from devloop.
Netshapescripts/netshape_bench.pyImplemented where Linux network namespaces and tc are available. Cross-subject latency claims require the same shaped pipe and server identity.
Dashboard/control planecloudflare/Worker/D1/R2 scaffold exists; live historical dashboard claims require deployed ingest and archived rows.

Coverage details and public-claim rules live in docs/benchmark-coverage.md, docs/publish-checklist.md, and docs/statistical-methodology.md.

In Development And Roadmap

The benchmark suite is also a product map for Oak. Current gaps are intentional work items, not hidden caveats:

  • Compact agent-facing commands: Oak has verified core-equivalent diff --stat and diff --print mappings in config/command_semantics.json; still needed are short/porcelain or JSON status, diff name-only, and quiet commit output.
  • Remote-backed Oak rows: Git local-file remote.* rows are implemented; Oak network rows use remote.net.* when a disposable remote is configured. Local file and network transports are never latency-comparable.
  • Lazy acquisition story: the suite is built to show time-to-first-read and bytes-to-first-read for Oak mount versus Git clone variants, but no public Oak-vs-Git acquisition claim exists until Oak mount succeeds on the large mirror and enough interleaved repetitions are captured.
  • Workspace-per-agent fleets: Git worktree contention is implemented. Oak workspace-per-task needs real disposable remotes, high-concurrency mount/push behavior, branch cleanup, and merge/integration plumbing.
  • Full Oak task loop: the roadmap is mount -> edit -> commit -> push -> desc -> finish/clean -> remount follow-up, compared with Git's clone/worktree, branch, commit, push, cleanup, and follow-up flow.
  • Hosted workflow evidence: platform rows need GitHub org assets, PATs or app credentials, webhooks, disposable repos, and real Oak/GitHub provider wiring before public hosted-branch claims.
  • XL and monorepo evidence: generators and registry entries exist, but planned or null-manifest fixtures are not performance evidence until consumed by active profiles on pinned runners.
  • Product dependencies exposed by skip rows include a self-hostable Oak server, bulk history import, webhooks, branch TTL/cleanup, high-concurrency mounts and pushes, 100k-branch scale, and 20 GB repository support.
  • Additional subject families such as Jujutsu are pilot configuration only until golden outputs and lane wiring land.

The eventual public roadmap should make each of these measurable: a runner class, a disposable repo or fixture, an operation vocabulary, raw JSONL, a fair comparator, and a publish gate.

Profiles

ProfileIntended useShapes
microNoise-floor A/A probes, integration tests, fastest regression signalTiny text repo only, 3 reps
smokeEvery local change or pushTiny text repo, one medium binary, a few large binaries
standardNightly or pre-mergeMany small files, wide dirty tree, 128 MB single file, many 8 MB files
largeDedicated runner50k small files, 1 GB single file, multi-GB many-large-file repo

Smoke catches regressions. Public speed claims need pinned hardware, enough repetitions, randomization, tuned Git rows, and confidence intervals.

Methodology Guardrails

The machine-readable row is the durable source of truth. Summaries and dashboards are derived views.

Key contracts:

  • agent-default measures what an agent would naturally call today.
  • core-equivalent measures the closest equivalent semantic output level across subjects. Rows with compatibility notes are diagnostic, not proof.
  • Null means unmeasured, never zero.
  • Returncode 77 means skipped with a skip_reason; skip rows are coverage work items.
  • Scenario and operation names are immutable identity. Changed semantics need a new name or version.
  • Network rows and local-file rows are different measurements and are never aggregated together.
  • Real-agent rows are never aggregated across instruction levels.
  • Latency compares inside one runner class; portable cross-runner claims use same-run ratios and confidence intervals.
  • Byte/token wins must cite information recall and pipe compatibility from the same run.

Methodology docs:

  • docs/command-semantics.md: agent-default versus core-equivalent.
  • docs/benchmark-coverage.md: implemented evidence versus planned coverage.
  • docs/publish-checklist.md: public claim gates.
  • docs/statistical-methodology.md: repetitions, confidence intervals, and overclaiming rules.
  • docs/tuned-git-baselines.md: credible Git modes beyond stock Git.
  • docs/runner-discipline.md: runner identity, cache state, environment sampling, and calibration.

What Is Timed

Core VCS rows time operations such as:

  • repo.init
  • snapshot.initial
  • status.clean
  • status.dirty
  • diff.dirty
  • snapshot.dirty
  • branch.create
  • task.snapshot
  • remote.push.first, remote.clone.cold, remote.pull.uptodate for Git local-file remotes
  • remote.net.push.first, remote.net.clone.cold, remote.net.fetch.uptodate for Oak network remotes when configured

For Git, staging is part of snapshot timing because agents pay that step. For Oak, oak commit --no-verify is the snapshot operation. Tuned Git modes are set up outside the timed region after repo.init.

Remote rows record remote_transport and remote_server. Same-machine Git remote rows isolate VCS transfer cost from network jitter; Oak network rows measure the real Oak server path. Reports must not subtract one from the other as a latency delta.

Agent-Efficiency Metrics

The direct CLI harness records more than wall-clock time:

  • tool_call_count, vcs_tool_call_count, and terminal_tool_call_count.
  • estimated_tokens_agent_emitted, estimated_tokens_agent_ingested, and estimated_cost_weighted_tokens.
  • Tool-call envelope fields, because a terminal call includes model-emitted tool-use JSON and model-ingested tool-result framing.
  • raw_output_bytes, admitted output counts, truncation flags, and byte counts.
  • proc.spawn rows for binary startup overhead.
  • Output determinism and ANSI-in-pipe probes.
  • peak_rss_bytes on platforms where child resource usage is observable.
  • status.dirty.inforecall, diff.dirty.inforecall, and diff.full.inforecall probe rows for bytes saved versus information lost.
  • Tail latency summaries with sample-count honesty: p95 requires n>=20 and p99 requires n>=100.

Tool calls are exact for this harness: each terminal command counts once. Git snapshot operations intentionally count as two calls (git add . plus git commit), while Oak snapshots count as one oak commit.

Token counts in direct CLI rows are estimates, not provider billing tokens. Use scripts/token_calibration.py before publishing cross-subject token deltas. End-to-end agent rows should prefer provider totals from token_metrics.total_tokens_reported when adapters expose them.

Tuned Git Baselines

Stock Git alone can overstate Oak wins. Tuned Git modes are wired as derived subjects:

python3 scripts/bench.py --git-modes untracked_cache,split_index,fsmonitor,lfs

Public wide-tree status/diff tables should include at least git_untracked_cache and git_fsmonitor. git_lfs applies only to configured binary scenarios and emits explicit skip rows when Git LFS is unavailable or not applicable.

Running Lanes

Core:

python3 scripts/bench.py --profile smoke
python3 scripts/regression_report.py results/latest.jsonl
python3 scripts/benchmark_stats.py results/latest.jsonl

Scripted workflows:

python3 scripts/workflow_ab.py --workflows all --runs 3

Real-agent workflow validation:

python3 scripts/agent_workflow.py --list-agents
python3 scripts/agent_workflow.py --agents mock --subjects git,oak_installed

Parallel contention:

python3 scripts/parallel_contention.py --subjects git,oak_installed \
  --workers 2,8,32 --commits-per-worker 5

Mount and acquisition probes:

. config/bench-env.sh  # remote-backed lanes; safe no-secret defaults
python3 scripts/mount_probe.py
OAK_BENCH_MIRROR_REPO=oak/bench-large-mirror \
  [email protected]:oakdotspace/bench-large-mirror.git \
  python3 scripts/mount_vs_clone.py --reps 5

Platform capability or fake-provider branch triage:

python3 scripts/platform_lifecycle.py --scenario platform_capability_probe
python3 scripts/platform_lifecycle.py \
  --platform github \
  --driver fake-provider \
  --scenario branch_triage_n4

Use --track agent-default for current CLI/agent UX and --track core-equivalent when making VCS-mechanics claims.

Results

Each run writes raw JSONL and, when available, a Markdown summary. Core rows use:

  • results/<timestamp>.jsonl
  • results/<timestamp>.summary.md
  • results/latest.jsonl
  • results/latest.summary.md

Other lanes may write suffixed latest files such as latest.mount.jsonl or latest.platform.jsonl, and long campaigns can flush <timestamp>.<lane>.partial.jsonl as crash insurance before final validation.

Never commit raw real-agent transcripts, generated result JSONL, temporary workspaces, or large benchmark artifacts. Archive publishable raw rows outside the repo and link them from reports.

Subject Configuration

By default, git and oak_installed resolve from PATH. Optional local-build subjects are disabled until explicitly requested. Edit config/subjects.toml or pass overrides:

python3 scripts/bench.py \
  --subjects git,oak_installed,oak_local \
  --oak-local-bin ../oak/target/release/oak \
  --oak-repo ../oak

To benchmark exact main against local changes, build or place both binaries at stable paths and enable oak_main and oak_local, or pass the paths with CLI flags. The harness does not mutate the Oak source checkout.

Architecture

Shared measurement policy lives in scripts/oakbench/: subject identity, environment control, timed execution, token accounting and cost weights, remotes, run locks, row contracts, command semantics, stream adapters, fixture/config IO, runner identity, and reporting math.

Lane scripts stay thin over that core:

  • bench.py
  • workflow_ab.py
  • agent_workflow.py
  • parallel_contention.py
  • mount_probe.py
  • mount_vs_clone.py
  • platform_lifecycle.py
  • netshape_bench.py
  • task_loop.py

Domain vocabulary is in CONTEXT.md; load-bearing decisions are ADRs under docs/adr/.

Testing The Instruments

A benchmark that measures with untested instruments cannot make accuracy claims. The suite ships tests for row contracts, adapter streams, token accounting, command semantics, skip-row coherence, fixture registries, calibration math, oracles, and reporting:

python3 -m unittest discover -s tests

Run the test suite before pushing harness changes. For measurement-policy refactors, use scripts/row_parity.py to prove the new rows are measurement-identical before trusting the refactor.

python3 scripts/row_parity.py old/latest.jsonl new/latest.jsonl

Cloud And CI

Cloudflare is the control and publishing plane, not the latency measurement plane. Workers, D1, and R2 receive and index raw rows; dedicated runners execute benchmarks on pinned hardware.

See docs/cloud.md and docs/infrastructure-plan.md for the recommended setup: per-push smoke, nightly standard, dedicated large-file/monorepo runners, Linux shaped-network runners, macOS mount/agent runners, archived raw rows, and a small dashboard for Oak-vs-Git and Oak-vs-previous-Oak deltas.