Long-Horizon SWE Benchmark

Measuring frontier coding agents on original, long-horizon engineering tasks

DeepSWE is a contamination-free benchmark built to separate today's leading coding agents. Tasks are written from scratch across 91 repositories and 5 languages, with hand-written verifiers that test software behavior rather than implementation details.

91
Repositories
5
Languages
113
Tasks
5.5×
More Code per Task

Why DeepSWE

C

Contamination Free

Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.

D

High Diversity

Tasks span a broad pool of 91 repositories across 5 languages, covering real-world software engineering scenarios.

R

Real-World Complexity

Prompts are roughly half the length of SWE-bench Pro's, yet solutions require 5.5× more code and ~2× more output tokens.

V

Reliable Verification

Verifiers are hand-written to test software behavior rather than implementation details, ensuring accurate evaluation.

Leaderboard

12 / 16 models
# Model Score
1 gpt-5.5 xhigh 70% ±4%
2 gpt-5.4 xhigh 56% ±5%
3 claude-opus-4.7 max 54% ±5%
4 claude-sonnet-4.6 high 32% ±4%
5 gemini-3.5-flash medium 28% ±4%
6 gpt-5.4-mini xhigh 24% ±4%
7 kimi-k2.6 24% ±4%
8 mimo-v2.5-pro 19% ±4%
9 glm-5.1 18% ±4%
10 gemini-3.1-pro 10% ±3%
11 deepseek-v4-pro 8% ±2%
12 gemini-3-flash 5% ±2%
All models are run with mini-swe-agent

How DeepSWE Works

01

Task Authorship

Each task is written from scratch by expert engineers. No adaptation from existing commits or PRs ensures zero contamination in model training data.

02

Multi-Repository Sampling

Tasks are drawn from 91 diverse repositories across TypeScript, Go, Rust, JavaScript, and Python, covering real-world codebases.

03

Agent Execution

Frontier coding agents receive the task prompt and generate a solution. Prompts are concise while solutions demand significant code output.

04

Behavioral Verification

Hand-written verifiers test whether the software behaves correctly, independent of implementation specifics, ensuring reliable scoring.

Task Examples

113 total tasks
Abort pending body reads on shutdown
Ensure interrupted request and response body reads, formData parsing, and discarded timers abort cleanly during shutdown.
capricorn86/happy-dom TypeScript
Fix PromQL label sorting across typed and untyped values
PromQL label sorting must order mixed typed and untyped label values with stable typed comparison rules.
prometheus/prometheus Go
Add config file parsing to Cliffy commands
Add command-level config file loading, parsing, merging, and precedence handling.
c4spar/cliffy TypeScript
Add deterministic map conflict detection to Y.Map writes
Add strict, deterministic conflict detection for Y.Map key writes with collect and error policies.
yjs/yjs JavaScript
Add trap coredump generation to wasmi
Generate opt-in Wasm coredumps on traps and attach the bytes to errors.
wasmi-labs/wasmi Rust
Add XML diff, patch, and merge operations to etree
Add recursive XML diffing, patch generation and application, reverse patching, three-way merge, and diff summaries.
beevik/etree Go

Read the Full Blog

01 Introduction Why a new benchmark
02 Overview What separates DeepSWE
03 Methodology How tasks and verifiers are built
04 Results Where frontier models diverge
05 Qualitative Analysis How each frontier model fails
06 Limitations & Future Work What we'd do differently

Ready to benchmark your agent?

Run DeepSWE against your frontier coding model and see how it compares on original, long-horizon engineering tasks.

Run DeepSWE