Each stack's ARscore = LLM SWE-bench % × Agent Architecture Score (rubric total) / 100 × 10. This factors both the LLM's raw coding ability and the agent's architectural quality into a single score.
Example: Claude Code (rubric 86) + Opus Mythos (SWE-bench 93.9%) = 93.9 × 86 / 100 × 10 = 80.8 ARscore.
All agent architecture scores are evaluated on a 7-dimension rubric. See the full rubric →
We use SWE-bench Verified (pass@1 on 500 human-validated instances) as the primary LLM coding benchmark. Scores are sourced from official leaderboards and verified against multiple third-party sources. Learn more about SWE-bench →
Each agent is evaluated on 7 architectural dimensions (max 100pts): Multi-Agent Orchestration (20), Memory & Context (15), Tool System (20), Prompt Cache & Cost (10), Safety & Permissions (15), Reliability & Recovery (10), Community & Ecosystem (10).
For proprietary agents without public source code, scores are estimated from published documentation and community analysis (sources cited per score). Open-source agents are scored from direct source code analysis. ±1 tolerance applies to all scores.
Removed from rubric: agents without sufficient public data (Claw Code, OpenClaude).
← Back to Guides