~ / coding tests

AI Coding Agent Tests

Static scores are not enough. This page separates what AgentRanks currently estimates from what should be proven through repeatable coding tasks, terminal runs, hidden tests, cost tracking, and source-labeled benchmark evidence.

External coding evidence

Recent benchmark signals to watch. These are evidence inputs, not final AgentRanks test results.

01 Codex + GPT-5.5Reported as leading Terminal-Bench v2.1 in June 2026 coverage; needs direct source verification before becoming an AgentRanks official score. 83.4% TestTerminal-Bench v2.1 Sourcereported Statusevidence 02 Claude Code + Fable 5Very strong reported score, but Fable 5 access was suspended by Anthropic on June 12, 2026. 83.1% TestTerminal-Bench v2.1 Sourcereported / unavailable Statuscaution 03 Claude Fable 5Model-level coding signal, not an agent workflow score. 95.0% TestSWE-bench Verified Sourcethird-party snapshot Statusevidence 04 Claude Opus 4.8Useful model-level signal for Claude Code-style stacks. 88.6% TestSWE-bench Verified Sourcethird-party snapshot Statusevidence 05 GPT-5.5Useful model-level signal for Codex-style stacks. 82.6% TestSWE-bench Verified Sourcethird-party snapshot Statusevidence 06 Claude Fable 5Hard terminal task model benchmark; availability caveat applies. 62.9% TestTerminal-Bench Hard Sourcethird-party evaluation Statusevidence 07 GPT-5.5Hard terminal task model benchmark, useful as a model-side signal. 60.6% TestTerminal-Bench Hard Sourcethird-party evaluation Statusevidence

Current score audit

The existing ARscore is useful for discovery, but it is a proxy rather than a tested coding-ability score.

legacy_ARscore = agent_architecture_score * model_SWE_bench_percent / 100
Verdict: Useful for first-pass stack discovery, but not accurate enough to be treated as a real coding ability leaderboard.
#Current top proxy stacksARscoreAgentModel SWEEvidence status
1Claude Code + Opus Mythos
Proxy only: Claude Code architecture score times Opus Mythos SWE-bench
80.88693.9%needs AgentRanks run
2Claude Code + Opus 4.8
Proxy only: Claude Code architecture score times Opus 4.8 SWE-bench
76.28688.6%needs AgentRanks run
3Claude Code + Opus 4.7
Proxy only: Claude Code architecture score times Opus 4.7 SWE-bench
75.38687.6%needs AgentRanks run
4Codex + GPT-5.5
Proxy only: Codex architecture score times GPT-5.5 SWE-bench
69.48482.6%needs AgentRanks run
5Claude Code + Sonnet 4.6
Proxy only: Claude Code architecture score times Sonnet 4.6 SWE-bench
68.58679.6%needs AgentRanks run
6Codex + GPT-5.3 Codex
Proxy only: Codex architecture score times GPT-5.3 Codex SWE-bench
63.28475.2%needs AgentRanks run
7Cursor + Opus 4.8
Proxy only: Cursor architecture score times Opus 4.8 SWE-bench
62.97188.6%needs AgentRanks run
8Cursor + GPT-5.5
Proxy only: Cursor architecture score times GPT-5.5 SWE-bench
58.67182.6%needs AgentRanks run

AgentRanks test pack

The product upgrade path: run identical coding tasks against every stack and publish pass rate, cost, time, retries, and code-quality notes.

Bug fix with hidden tests

Patch a real failing issue in a small repo, add regression coverage, and pass hidden tests.

30% weight

Terminal autonomy

Inspect a repo, run commands, diagnose failures, and produce a working fix without manual file hints.

20% weight

Feature build

Implement a small UI/API feature from product requirements with tests and no unrelated churn.

15% weight

Refactor safety

Refactor a shared module while preserving behavior and avoiding over-broad edits.

15% weight

Cost and latency

Measure tokens, wall time, retries, and cost per accepted solution.

10% weight

Maintainability review

Score code clarity, test quality, minimalism, and ease of future modification.

10% weight

Benchmark sources

Primary and secondary sources that should feed the evidence layer.

SWE-bench Verified

Real GitHub issue resolution with tests; strong signal, but increasingly saturated and source quality varies by run.

official benchmark

Terminal-Bench 2.0

Closer to real coding-agent work because the system must inspect files, run commands, debug, and finish tasks.

official benchmark

Terminal-Bench Hard

Useful for separating frontier coding models on harder terminal tasks.

third-party evaluation

SWE-bench / Vals AI

Useful secondary snapshot with recent Fable 5, Opus 4.8, and GPT-5.5 figures, but should not replace primary benchmark links.

third-party evaluation