~ / guides / scoring audit

Score Audit

The current AgentRanks score is directionally useful, but it should not be marketed as a complete real-world coding ability score until repeatable task runs are collected.

Current formula

Good for a first-pass leaderboard, weak as a proof of coding ability.

legacy_ARscore = agent_architecture_score * model_SWE_bench_percent / 100

Useful for first-pass stack discovery, but not accurate enough to be treated as a real coding ability leaderboard.

Accuracy problems

Where the current system can mislead users.

It mixes subjective agent architecture scores with model benchmark scores.
It does not include end-to-end terminal task completion.
It does not include cost per completed task, retry count, latency, or tool failure rate.
It does not distinguish official, third-party, self-reported, and rumor sources.
It can over-rank unavailable models or models with strong benchmark numbers but weak product access.

Recommended formula

Use this only after real AgentRanks test-pack runs exist.

0.35 * real_task_pass + 0.20 * SWE_bench + 0.15 * terminal_bench + 0.10 * cost_efficiency + 0.10 * availability + 0.10 * maintainability

Bug fix with hidden tests

Patch a real failing issue in a small repo, add regression coverage, and pass hidden tests.

30%

Terminal autonomy

Inspect a repo, run commands, diagnose failures, and produce a working fix without manual file hints.

20%

Feature build

Implement a small UI/API feature from product requirements with tests and no unrelated churn.

15%

Refactor safety

Refactor a shared module while preserving behavior and avoiding over-broad edits.

15%

Cost and latency

Measure tokens, wall time, retries, and cost per accepted solution.

10%

Maintainability review

Score code clarity, test quality, minimalism, and ease of future modification.

10%

Implementation rule: Keep legacy ARscore visible as a proxy, add Tested Score v2 only where AgentRanks has reproducible logs, task prompts, pass/fail outputs, runtime, token cost, and reviewer notes.