Performance of Open source LLM models on coding



https://www.youtube.com/watch?v=xRnK2IFI31E

The video presents a comparison of several leading AI models, including Qwen3, Kimi K2, Claude Opus 4, and Deepseek-V3-0324, showcasing their performance across various benchmarks and practical tasks. The speaker aims to highlight the strengths and weaknesses of each model, particularly focusing on the recently updated Qwen3. Qwen3 and its Hybrid Reasoning Model: Qwen3, developed by Alibaba, initially introduced a “hybrid thinking mode” that allowed the model to reason step-by-step for complex problems (Thinking Mode) or provide quick responses for simpler queries (Non-Thinking Mode). However, Qwen has since shifted its strategy, releasing separate Instruct and Thinking models to maximize quality. The Qwen3-235B-A22B-Instruct-2507 is an updated “non-thinking” variant, optimized for instruction following, logical reasoning, mathematics, science, coding, and tool usage, and features a 256K context window. Benchmark Comparisons: The video presents a bar chart comparing Qwen3-235B-A22B-Instruct-2507 (referred to as Qwen3 Instruct-2507 in the chart) against Qwen3-235B-A22B (Non-thinking), Kimi K2, Claude Opus 4 (Non-thinking), and Deepseek-V3-0324. Key benchmark highlights include:

  • Overall Performance: Qwen3 Instruct-2507 consistently outperforms other models shown in the chart across various benchmarks when it comes to non-thinking modes.
  • GPQA (Knowledge): Qwen3 Instruct-2507 scores 77.5, surpassing Claude Opus 4 (Non-thinking) at 74.9 and Kimi K2 at 75.1.
  • AIME25 (Mathematics): Qwen3 Instruct-2507 leads with 70.3, significantly higher than Deepseek-V3-0324 (46.6) and GPT-4o-0327 (26.7).
  • LiveCodeBench v6 (Coding): Qwen3 Instruct-2507 achieves 51.8, while Kimi K2 scores 48.9 and Claude Opus 4 (Non-thinking) gets 44.6.
  • Arena-Hard v2 (Human Preference Alignment): Qwen3 Instruct-2507 scores 79.2, outperforming Kimi K2 (66.1) and Claude Opus 4 (Non-thinking) (51.5).
  • BFCL-v3 (Agent Capability): Qwen3 Instruct-2507 reaches 70.9, ahead of Kimi K2 (65.2) and Claude Opus 4 (Non-thinking) (64.7).
  • ARC-AGI (Reasoning): The video specifically mentions Qwen3 Instruct-2507 achieving 41.8, which the speaker notes is impressive for a non-reasoning model and higher than Claude 4’s 30.3 when thinking is disabled.
  • MultiPL-E (Coding): Qwen3 Instruct-2507 scores 87.9, slightly outperforming Deepseek-V3-0324 (82.2) and OpenAI’s GPT-4o-0327 (82.7), though it trails Claude Opus 4 (Non-thinking) which scored 88.5.

Practical Demonstrations:

  1. Legendary Pokémon Encyclopedia: Prompt: Create a simple encyclopedia of the first 25 legendary Pokémon with types, lore snippets, and images, as a single HTML file. Qwen3 (Non-thinking): Generated the website structure and text but failed to properly include images. Kimi K2: Successfully created a visually appealing website with images and even added a disclaimer at the bottom.

  2. Bouncing Balls in a Heptagon: Prompt: Write an HTML program that shows 20 balls bouncing inside a spinning heptagon, including detailed physics like gravity, friction, and impact bounce height. Qwen3 (Non-thinking): Successfully produced a simulation with 20 balls bouncing realistically within a spinning heptagon. The speaker noted its impressive performance in tracking multiple balls, outperforming even some larger models. Kimi K2: Also produced a visually impressive simulation with realistic ball scattering.

  3. Procedural 3D Planet Generation: Prompt: Create a realistic, procedurally generated 3D planet in Three.js, including detailed terrain, biomes, atmospheric effects, lighting, water, clouds, rotation, and camera interaction. Qwen3 (Non-thinking): Generated a planet where the main landmasses appeared static, while the inner core rotated, which was not the desired behavior. Kimi K2: Produced a visually appealing planet with proper rotation, including an atmospheric layer and realistic shadow formation. Claude Opus 4 (Thinking Enabled): Demonstrated a more responsive and accurate output, adhering more closely to the complex prompt specifications, including proper rotation and shadow casting.

  4. Maze Solving (Reasoning Test): Prompt: Solve a 10x10 ASCII maze from A1 to J10, providing a comma-separated list of cell coordinates, moving only through adjacent cells without crossing walls. Qwen3 (Non-thinking): Exhibited an extensive self-dialogue, attempting to interpret the maze structure and possible paths. However, despite this detailed “thought process,” the model ultimately failed to find a correct and valid path. Kimi K2: Showcased similar detailed internal reasoning and even “backtracking” in its thought process, but also failed to produce a correct and valid solution. Claude Opus 4 (Thinking Enabled): Successfully solved the maze by employing its “extended thinking” capabilities and utilizing a code interpreter tool. This highlights the crucial role of tools in complex reasoning tasks for AI models.

Conclusion: While Qwen3 Instruct-2507 demonstrates impressive performance across many benchmarks, especially in non-thinking modes, the practical tests reveal that models with robust tool-use capabilities, such as Claude Opus 4 with thinking enabled, still excel in complex reasoning and physics-based tasks where iterative problem-solving and external tools are beneficial. The video suggests that having dedicated models for different types of tasks (reasoning vs. non-reasoning) or integrating tools for complex problems might be a strategic direction for AI development.