NemoClaw Knowledge Wiki

❯

❯

swe bench verified

swe-bench-verified

Apr 18, 20261 min read

benchmarks
software-engineering
LLM-evaluation
benchmarking
github-issues
autonomous-agents

SWE-bench Verified

A benchmark designed to evaluate the ability of Large Language Models (LLMs) to resolve real-world software engineering issues by autonomously addressing GitHub issues.

Recent Performance Benchmarks

gemini-3-flash achieved a score of 78%, outperforming both gemini-3-pro and Claude Sonnet 4.5.

Source: 2026 04 14 Mathew Berman Gemini Flash 3 and Nvidia Nematron 3

Source Notes

2026-04-14: [[lab-notes/2026-04-14-Optimizing-AI-Costs-and-Privacy-with-Local-Open-Source-Models-and-Hybr|“But OpenClaw is expensive…“]]

Graph View

SWE-bench Verified
Recent Performance Benchmarks
Source Notes

Backlinks

INDEX
Claude Code updates and Claude Opus 4.1
Gemini flash 3
Mathew Berman - Gemini Flash 3 and Nvidia Nematron 3
SWE-bench Verified
coding-benchmarks
token-pricing
Tools & Platforms
claude-sonnet-45
gemini-3-flash
gemini-3-pro
Claude Code updates and Claude Opus 4.1
Gemini flash 3
Mathew Berman - Gemini Flash 3 and Nvidia Nematron 3
Anthropic Claude Mythos: AI Security and Performance Breakthroughs for Critical Software
Anthropic Claude Mythos AI Security and Performance Breakthroughs for

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community