Leaderboard
Compare model performance across GMP knowledge and task completion benchmarks. Click a model name to view detailed results.
Task type
Creator
Weights
| # | Model | Overall | GMP Knowledge | Task Completion | Avg Latency | Total Tokens | # Evals |
|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 95.5% | 100.0% | 90.9% | 32.6s | 69k | 39 |
| 2 | GPT-5.4 | 94.2% | 100.0% | 88.4% | 8.3s | 27k | 39 |
| 3 | Claude Sonnet 4.6 | 94.2% | 100.0% | 88.3% | 27.6s | 73k | 39 |
| 4 | Claude Haiku 4.5 | 93.0% | 100.0% | 86.1% | 10.3s | 56k | 39 |
| 5 | DeepSeek V4 Pro | 92.9% | 97.1% | 88.6% | 28.7s | 48k | 40 |
| 6 | DeepSeek-R1 | 92.7% | 97.1% | 88.2% | 19.8s | 33k | 39 |
| 7 | MiniMax M2.7 | 92.4% | 97.1% | 87.6% | 19.2s | 68k | 40 |
| 8 | Gemini 3.1 Pro | 91.2% | 97.1% | 85.2% | 22.8s | 54k | 39 |
| 9 | Qwen3.6 35B A3B | 91.1% | 94.3% | 87.9% | 12.0s | 83k | 40 |
| 10 | GPT-5.4 mini | 90.5% | 100.0% | 81.0% | 2.6s | 21k | 39 |
| 11 | Qwen3.6 27B | 90.4% | 94.3% | 86.5% | 33.6s | 83k | 40 |
| 12 | Mistral Large 3 675B | 90.1% | 100.0% | 80.2% | 10.2s | 30k | 39 |
| 13 | DeepSeek-V3.2 | 88.2% | 97.1% | 79.3% | 19.2s | 46k | 79 |
| 14 | Gemini 3 Flash | 87.5% | 100.0% | 75.0% | 3.9s | 21k | 39 |
| 15 | Mistral Small 2603 | 87.1% | 100.0% | 74.2% | 3.1s | 26k | 39 |
| 16 | GPT-5.4 nano | 86.1% | 94.3% | 77.8% | 2.8s | 23k | 39 |
| 17 | Gemma 4 26B A4B IT | 85.3% | 97.1% | 73.5% | 8.9s | 25k | 40 |
| 18 | Llama 4 Maverick | 84.1% | 100.0% | 68.3% | 15.2s | 25k | 39 |
| 19 | Gemma 4 31B IT | 83.7% | 97.1% | 70.3% | 24.1s | 39k | 40 |
| 20 | DeepSeek V4 Flash | 83.2% | 94.3% | 72.2% | 40.6s | 89k | 40 |
| 21 | Gemini 3.1 Flash-Lite | 78.7% | 97.1% | 60.3% | 2.4s | 17k | 39 |
| 22 | DeepSeek-R1-Distill-Qwen-32B | 77.8% | 100.0% | 55.6% | 45.6s | 33k | 39 |
| 23 | Llama 3.3 70B Instruct | 77.7% | 94.3% | 61.2% | 60.6s | 23k | 39 |
| 24 | Qwen3.5-397B-A17B | 77.4% | 65.7% | 89.2% | 30.8s | 81k | 39 |
| 25 | Llama 4 Scout | 75.0% | 97.1% | 52.8% | 4.5s | 23k | 39 |
| 26 | Qwen3.5-35B-A3B | 70.9% | 54.3% | 87.5% | 30.1s | 174k | 39 |