Long-Horizon SWE Benchmark

Measuring frontier coding agents on original, long-horizon engineering tasks

DeepSWE is a contamination-free benchmark built to separate today's leading coding agents. Tasks are written from scratch across 91 repositories and 5 languages, with hand-written verifiers that test software behavior rather than implementation details.

Run DeepSWE View Leaderboard

Repositories

Languages

113

Tasks

5.5×

More Code per Task

Why DeepSWE

Contamination Free

Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.

High Diversity

Tasks span a broad pool of 91 repositories across 5 languages, covering real-world software engineering scenarios.

Real-World Complexity

Prompts are roughly half the length of SWE-bench Pro's, yet solutions require 5.5× more code and ~2× more output tokens.

Reliable Verification

Verifiers are hand-written to test software behavior rather than implementation details, ensuring accurate evaluation.

Leaderboard

12 / 16 models

# Model Score

1 gpt-5.5 xhigh 70% ±4%

2 gpt-5.4 xhigh 56% ±5%

3 claude-opus-4.7 max 54% ±5%

4 claude-sonnet-4.6 high 32% ±4%

5 gemini-3.5-flash medium 28% ±4%

6 gpt-5.4-mini xhigh 24% ±4%

7 kimi-k2.6 24% ±4%

8 mimo-v2.5-pro 19% ±4%

9 glm-5.1 18% ±4%

10 gemini-3.1-pro 10% ±3%

11 deepseek-v4-pro 8% ±2%

12 gemini-3-flash 5% ±2%

All models are run with mini-swe-agent

How DeepSWE Works

Task Authorship

Each task is written from scratch by expert engineers. No adaptation from existing commits or PRs ensures zero contamination in model training data.

Multi-Repository Sampling

Tasks are drawn from 91 diverse repositories across TypeScript, Go, Rust, JavaScript, and Python, covering real-world codebases.

Agent Execution

Frontier coding agents receive the task prompt and generate a solution. Prompts are concise while solutions demand significant code output.

Behavioral Verification

Hand-written verifiers test whether the software behaves correctly, independent of implementation specifics, ensuring reliable scoring.

Task Examples

113 total tasks

Abort pending body reads on shutdown

Ensure interrupted request and response body reads, formData parsing, and discarded timers abort cleanly during shutdown.

capricorn86/happy-dom TypeScript

Fix PromQL label sorting across typed and untyped values

PromQL label sorting must order mixed typed and untyped label values with stable typed comparison rules.

prometheus/prometheus Go

Add config file parsing to Cliffy commands

Add command-level config file loading, parsing, merging, and precedence handling.

c4spar/cliffy TypeScript

Add deterministic map conflict detection to Y.Map writes

Add strict, deterministic conflict detection for Y.Map key writes with collect and error policies.

yjs/yjs JavaScript

Add trap coredump generation to wasmi

Generate opt-in Wasm coredumps on traps and attach the bytes to errors.

wasmi-labs/wasmi Rust

Add XML diff, patch, and merge operations to etree

Add recursive XML diffing, patch generation and application, reverse patching, three-way merge, and diff summaries.

beevik/etree Go

Read the Full Blog

01 Introduction Why a new benchmark

02 Overview What separates DeepSWE

03 Methodology How tasks and verifiers are built

04 Results Where frontier models diverge

05 Qualitative Analysis How each frontier model fails

06 Limitations & Future Work What we'd do differently

Ready to benchmark your agent?

Run DeepSWE against your frontier coding model and see how it compares on original, long-horizon engineering tasks.

Run DeepSWE