AI Coding Agent Tests
Static scores are not enough. This page separates what AgentRanks currently estimates from what should be proven through repeatable coding tasks, terminal runs, hidden tests, cost tracking, and source-labeled benchmark evidence.
External coding evidence
Recent benchmark signals to watch. These are evidence inputs, not final AgentRanks test results.
Current score audit
The existing ARscore is useful for discovery, but it is a proxy rather than a tested coding-ability score.
| # | Current top proxy stacks | ARscore | Agent | Model SWE | Evidence status |
|---|---|---|---|---|---|
| 1 | Claude Code + Opus Mythos Proxy only: Claude Code architecture score times Opus Mythos SWE-bench | 80.8 | 86 | 93.9% | needs AgentRanks run |
| 2 | Claude Code + Opus 4.8 Proxy only: Claude Code architecture score times Opus 4.8 SWE-bench | 76.2 | 86 | 88.6% | needs AgentRanks run |
| 3 | Claude Code + Opus 4.7 Proxy only: Claude Code architecture score times Opus 4.7 SWE-bench | 75.3 | 86 | 87.6% | needs AgentRanks run |
| 4 | Codex + GPT-5.5 Proxy only: Codex architecture score times GPT-5.5 SWE-bench | 69.4 | 84 | 82.6% | needs AgentRanks run |
| 5 | Claude Code + Sonnet 4.6 Proxy only: Claude Code architecture score times Sonnet 4.6 SWE-bench | 68.5 | 86 | 79.6% | needs AgentRanks run |
| 6 | Codex + GPT-5.3 Codex Proxy only: Codex architecture score times GPT-5.3 Codex SWE-bench | 63.2 | 84 | 75.2% | needs AgentRanks run |
| 7 | Cursor + Opus 4.8 Proxy only: Cursor architecture score times Opus 4.8 SWE-bench | 62.9 | 71 | 88.6% | needs AgentRanks run |
| 8 | Cursor + GPT-5.5 Proxy only: Cursor architecture score times GPT-5.5 SWE-bench | 58.6 | 71 | 82.6% | needs AgentRanks run |
AgentRanks test pack
The product upgrade path: run identical coding tasks against every stack and publish pass rate, cost, time, retries, and code-quality notes.
Bug fix with hidden tests
Patch a real failing issue in a small repo, add regression coverage, and pass hidden tests.
Terminal autonomy
Inspect a repo, run commands, diagnose failures, and produce a working fix without manual file hints.
Feature build
Implement a small UI/API feature from product requirements with tests and no unrelated churn.
Refactor safety
Refactor a shared module while preserving behavior and avoiding over-broad edits.
Cost and latency
Measure tokens, wall time, retries, and cost per accepted solution.
Maintainability review
Score code clarity, test quality, minimalism, and ease of future modification.
Benchmark sources
Primary and secondary sources that should feed the evidence layer.
SWE-bench Verified
Real GitHub issue resolution with tests; strong signal, but increasingly saturated and source quality varies by run.
Terminal-Bench 2.0
Closer to real coding-agent work because the system must inspect files, run commands, debug, and finish tasks.
Terminal-Bench Hard
Useful for separating frontier coding models on harder terminal tasks.
SWE-bench / Vals AI
Useful secondary snapshot with recent Fable 5, Opus 4.8, and GPT-5.5 figures, but should not replace primary benchmark links.