AI / Agents

Gemini Ultra 2 Scores 89% on SWE-bench — What That Number Actually Means

Google's model leads the coding benchmark. We explain what SWE-bench actually tests, where Gemini Ultra 2 excels versus struggles, and how to decide if it's right for your workflow.

Google DeepMind·Tuesday, June 2, 2026 at 12:00 PM·6 min read

Gemini Ultra 2 Scores 89% on SWE-bench — What That Number Actually Means

Google's Gemini Ultra 2.0 scored 89.3% on SWE-bench Verified, the current standard benchmark for AI coding capability. For comparison, OpenAI's GPT-5 scores 87.1% and Anthropic's Claude 4 scores 84.6% on the same benchmark. The gap is real but smaller than the marketing treatment suggests, and understanding what SWE-bench actually tests is essential for knowing whether the benchmark result should influence which model you use.

SWE-bench Verified is not a coding trivia test. It consists of real GitHub issues drawn from popular open-source repositories including Django, Scikit-learn, and Sympy. Each task requires the model to read the issue description, navigate the existing codebase, write a patch, and pass the repository's existing test suite. Scoring 89.3% means Gemini Ultra 2 autonomously resolves 89 out of every 100 of these well-defined, real-world bugs.

Where Gemini Ultra 2 performs best: large codebase navigation, multi-file edits that maintain consistency across a repository, and linking issue descriptions to root-cause bugs in unfamiliar frameworks. Google attributes this to architecture improvements in how the model attends to long code context, and early professional user reports support the claim — particularly for tasks in large Python monorepos where the relevant code spans many modules.

The benchmark doesn't capture several capabilities that matter for daily development. Novel architecture decisions, tasks that require understanding unwritten business logic, and integration with proprietary internal APIs the model hasn't encountered during training all remain beyond what any current model does reliably. Benchmarks test defined problems; real work is mostly undefined.

Gemini Ultra 2 is available through Google AI Studio (with a free tier), Vertex AI for enterprise deployments, and as the backend for Gemini Code Assist in VS Code and JetBrains IDEs. The IDE integration is the most accessible starting point for developers who want to test it against their actual workflow without setting up API access.

The practical recommendation: benchmark results are a guide to which tool to try, not which tool to commit to. The best model for your specific codebase depends on your primary language, repository size, and typical task type. Running five of your own representative bugs through Gemini Ultra 2, GPT-5, and Claude 4 will tell you more about fit than any published leaderboard.

Source

Google DeepMind

Key Takeaway

An 89% SWE-bench score means Gemini Ultra 2 can autonomously fix roughly nine out of ten well-specified bugs in popular open-source projects — impressive, but real-world gains depend heavily on how clearly the problem is defined. Before committing to any single AI coding tool based on benchmarks, run your own test on five representative tasks from your actual codebase. The gap between models narrows significantly when you move from benchmark conditions to real work.

From the VoraNeo Shop

Build agentic coding workflows

Gemini Ultra 2 Scores 89% on SWE-bench — What That Number Actually Means

Related Stories

OpenAI Releases o3 to the API — Here's What Developers Need to Know

Claude 4 Enterprise: What 1 Million Tokens Actually Unlocks