SWE-bench Verified

A benchmark designed to evaluate the ability of Large Language Models (LLMs) to resolve real-world software engineering issues by autonomously addressing GitHub issues.

Recent Performance Benchmarks


Source: 2026 04 14 Mathew Berman Gemini Flash 3 and Nvidia Nematron 3

Source Notes

  • 2026-04-14: [[lab-notes/2026-04-14-Optimizing-AI-Costs-and-Privacy-with-Local-Open-Source-Models-and-Hybr|“But OpenClaw is expensive…“]]