SWE-bench Verified
A benchmark designed to evaluate the ability of Large Language Models (LLMs) to resolve real-world software engineering issues by autonomously addressing GitHub issues.
Recent Performance Benchmarks
- gemini-3-flash achieved a score of 78%, outperforming both gemini-3-pro and Claude Sonnet 4.5.
Source: 2026 04 14 Mathew Berman Gemini Flash 3 and Nvidia Nematron 3
Source Notes
- 2026-04-14: “But OpenClaw is expensive…”