Score Audit
The current AgentRanks score is directionally useful, but it should not be marketed as a complete real-world coding ability score until repeatable task runs are collected.
Current formula
Good for a first-pass leaderboard, weak as a proof of coding ability.
Accuracy problems
Where the current system can mislead users.
- It mixes subjective agent architecture scores with model benchmark scores.
- It does not include end-to-end terminal task completion.
- It does not include cost per completed task, retry count, latency, or tool failure rate.
- It does not distinguish official, third-party, self-reported, and rumor sources.
- It can over-rank unavailable models or models with strong benchmark numbers but weak product access.
Recommended formula
Use this only after real AgentRanks test-pack runs exist.
Bug fix with hidden tests
Patch a real failing issue in a small repo, add regression coverage, and pass hidden tests.
Terminal autonomy
Inspect a repo, run commands, diagnose failures, and produce a working fix without manual file hints.
Feature build
Implement a small UI/API feature from product requirements with tests and no unrelated churn.
Refactor safety
Refactor a shared module while preserving behavior and avoiding over-broad edits.
Cost and latency
Measure tokens, wall time, retries, and cost per accepted solution.
Maintainability review
Score code clarity, test quality, minimalism, and ease of future modification.