~ / coding tests

AI Coding Agent Tests

Static scores are not enough. This page separates what AgentRanks currently estimates from what should be proven through repeatable coding tasks, terminal runs, hidden tests, cost tracking, and source-labeled benchmark evidence.

Download test data Score audit Compare agents

External coding evidence

Recent benchmark signals to watch. These are evidence inputs, not final AgentRanks test results.

01 Codex + GPT-5.5Reported as leading Terminal-Bench v2.1 in June 2026 coverage; needs direct source verification before becoming an AgentRanks official score. 83.4% TestTerminal-Bench v2.1 Sourcereported Statusevidence 02 Claude Code + Fable 5Very strong reported score, but Fable 5 access was suspended by Anthropic on June 12, 2026. 83.1% TestTerminal-Bench v2.1 Sourcereported / unavailable Statuscaution 03 Claude Fable 5Model-level coding signal, not an agent workflow score. 95.0% TestSWE-bench Verified Sourcethird-party snapshot Statusevidence 04 Claude Opus 4.8Useful model-level signal for Claude Code-style stacks. 88.6% TestSWE-bench Verified Sourcethird-party snapshot Statusevidence 05 GPT-5.5Useful model-level signal for Codex-style stacks. 82.6% TestSWE-bench Verified Sourcethird-party snapshot Statusevidence 06 Claude Fable 5Hard terminal task model benchmark; availability caveat applies. 62.9% TestTerminal-Bench Hard Sourcethird-party evaluation Statusevidence 07 GPT-5.5Hard terminal task model benchmark, useful as a model-side signal. 60.6% TestTerminal-Bench Hard Sourcethird-party evaluation Statusevidence

Current score audit

The existing ARscore is useful for discovery, but it is a proxy rather than a tested coding-ability score.

legacy_ARscore = agent_architecture_score * model_SWE_bench_percent / 100

Verdict: Useful for first-pass stack discovery, but not accurate enough to be treated as a real coding ability leaderboard.

#	Current top proxy stacks	ARscore	Agent	Model SWE	Evidence status
1	Claude Code + Opus Mythos Proxy only: Claude Code architecture score times Opus Mythos SWE-bench	80.8	86	93.9%	needs AgentRanks run
2	Claude Code + Opus 4.8 Proxy only: Claude Code architecture score times Opus 4.8 SWE-bench	76.2	86	88.6%	needs AgentRanks run
3	Claude Code + Opus 4.7 Proxy only: Claude Code architecture score times Opus 4.7 SWE-bench	75.3	86	87.6%	needs AgentRanks run
4	Codex + GPT-5.5 Proxy only: Codex architecture score times GPT-5.5 SWE-bench	69.4	84	82.6%	needs AgentRanks run
5	Claude Code + Sonnet 4.6 Proxy only: Claude Code architecture score times Sonnet 4.6 SWE-bench	68.5	86	79.6%	needs AgentRanks run
6	Codex + GPT-5.3 Codex Proxy only: Codex architecture score times GPT-5.3 Codex SWE-bench	63.2	84	75.2%	needs AgentRanks run
7	Cursor + Opus 4.8 Proxy only: Cursor architecture score times Opus 4.8 SWE-bench	62.9	71	88.6%	needs AgentRanks run
8	Cursor + GPT-5.5 Proxy only: Cursor architecture score times GPT-5.5 SWE-bench	58.6	71	82.6%	needs AgentRanks run

AgentRanks test pack

The product upgrade path: run identical coding tasks against every stack and publish pass rate, cost, time, retries, and code-quality notes.

Bug fix with hidden tests

Patch a real failing issue in a small repo, add regression coverage, and pass hidden tests.

30% weight

Terminal autonomy

Inspect a repo, run commands, diagnose failures, and produce a working fix without manual file hints.

20% weight

Feature build

Implement a small UI/API feature from product requirements with tests and no unrelated churn.

15% weight

Refactor safety

Refactor a shared module while preserving behavior and avoiding over-broad edits.

15% weight

Cost and latency

Measure tokens, wall time, retries, and cost per accepted solution.

10% weight

Maintainability review

Score code clarity, test quality, minimalism, and ease of future modification.

10% weight

Benchmark sources

Primary and secondary sources that should feed the evidence layer.