Coding benchmarks
Metrics and frameworks used to evaluate the proficiency of large-language-models in software engineering tasks, including code generation, debugging, and repository-level problem-solving.
Key Benchmarks
- swe-bench-verified: A benchmark focused on evaluating models on real-world software engineering issues.
- Recent Performance: gemini-3-flash achieved a score of 78%, outperforming both gemini-3-pro and Claude Sonnet 4.5.
- mistral-3-large: A 675B parameter MoE model (Apache 2.0) used for competitive benchmarking against deepseek-v3 and kimi-k2.
Sources
- 2026 04 14 Mathew Berman Gemini Flash 3 and Nvidia Nematron 3
- 2026 04 14 Mistral latest model
Source Notes
- 2026-04-14: “But OpenClaw is expensive…”
- 2026-04-07: Meta Harness AI Self Evolution via Autonomous LLM Harness Optimization · ▶ source
- 2026-04-09: Anthropic Claude Mythos AI Security and Performance Breakthroughs for · ▶ source
- 2026-04-10: Qwen 36 Plus Open Source AIs Agentic Capabilities and Frontier · ▶ source
- 2026-04-12: MiniMax M27 Open Source LLM Technical Overview and Deployment Summary · ▶ source
- 2026-04-18: Anthropic Claude Opus 47 Agentic Coding Multimodal and Memory Advancem · ▶ source
- 2026-04-22: Google Gemma · ▶ source
- 2026-04-24: OpenAI GPT-5 · ▶ source
- 2026-04-26: DeepSeek V4: China
- 2026-05-01: Alibaba Qwen 3.6 27B: Advanced Local Agentic Coding and Multimodal AI Capabilities · ▶ source