Measuring frontier coding agents on original, long-horizon engineering tasks
DeepSWE is a contamination-free benchmark built to separate today's leading coding agents. Tasks are written from scratch across 91 repositories and 5 languages, with hand-written verifiers that test software behavior rather than implementation details.
Why DeepSWE
Contamination Free
Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.
High Diversity
Tasks span a broad pool of 91 repositories across 5 languages, covering real-world software engineering scenarios.
Real-World Complexity
Prompts are roughly half the length of SWE-bench Pro's, yet solutions require 5.5× more code and ~2× more output tokens.
Reliable Verification
Verifiers are hand-written to test software behavior rather than implementation details, ensuring accurate evaluation.
Leaderboard
12 / 16 modelsHow DeepSWE Works
Task Authorship
Each task is written from scratch by expert engineers. No adaptation from existing commits or PRs ensures zero contamination in model training data.
Multi-Repository Sampling
Tasks are drawn from 91 diverse repositories across TypeScript, Go, Rust, JavaScript, and Python, covering real-world codebases.
Agent Execution
Frontier coding agents receive the task prompt and generate a solution. Prompts are concise while solutions demand significant code output.
Behavioral Verification
Hand-written verifiers test whether the software behaves correctly, independent of implementation specifics, ensuring reliable scoring.
Task Examples
113 total tasksRead the Full Blog
Ready to benchmark your agent?
Run DeepSWE against your frontier coding model and see how it compares on original, long-horizon engineering tasks.
Run DeepSWE