SWE-bench

SWE-bench is a benchmark dataset designed to evaluate large language models on real-world software engineering tasks. Rather than relying on synthetic coding problems, it draws from actual issues and pull requests in open-source repositories. This approach requires models to work within existing codebases, understand project-specific contexts, and generate solutions that integrate with established code structures and dependencies.

Evaluation Approach

SWE-bench evaluates models by presenting them with real GitHub issues paired with their corresponding repository contexts. The task requires models to locate relevant code, understand the problem, and generate a fix that can be validated against the actual merged pull request solution. This mirrors authentic software engineering workflows where developers must diagnose problems within large, complex systems rather than solve isolated algorithmic puzzles.

SWE-bench Verified

SWE-bench Verified is a curated subset of the original benchmark that has undergone additional human review and validation. This filtered version ensures higher quality test cases by removing ambiguous or problematic issues, providing a more reliable evaluation dataset for assessing model performance on realistic software engineering challenges.

Source Notes