GMP Bench

Leaderboard

Compare model performance across GMP knowledge and task completion benchmarks. Click a model name to view detailed results.

Task type
Creator
Weights
#ModelOverallGMP KnowledgeTask CompletionAvg LatencyTotal Tokens# Evals
1Claude Opus 4.695.5%100.0%90.9%32.6s69k39
2GPT-5.494.2%100.0%88.4%8.3s27k39
3Claude Sonnet 4.694.2%100.0%88.3%27.6s73k39
4Claude Haiku 4.593.0%100.0%86.1%10.3s56k39
5DeepSeek V4 Pro92.9%97.1%88.6%28.7s48k40
6DeepSeek-R192.7%97.1%88.2%19.8s33k39
7MiniMax M2.792.4%97.1%87.6%19.2s68k40
8Gemini 3.1 Pro91.2%97.1%85.2%22.8s54k39
9Qwen3.6 35B A3B91.1%94.3%87.9%12.0s83k40
10GPT-5.4 mini90.5%100.0%81.0%2.6s21k39
11Qwen3.6 27B90.4%94.3%86.5%33.6s83k40
12Mistral Large 3 675B90.1%100.0%80.2%10.2s30k39
13DeepSeek-V3.288.2%97.1%79.3%19.2s46k79
14Gemini 3 Flash87.5%100.0%75.0%3.9s21k39
15Mistral Small 260387.1%100.0%74.2%3.1s26k39
16GPT-5.4 nano86.1%94.3%77.8%2.8s23k39
17Gemma 4 26B A4B IT85.3%97.1%73.5%8.9s25k40
18Llama 4 Maverick84.1%100.0%68.3%15.2s25k39
19Gemma 4 31B IT83.7%97.1%70.3%24.1s39k40
20DeepSeek V4 Flash83.2%94.3%72.2%40.6s89k40
21Gemini 3.1 Flash-Lite78.7%97.1%60.3%2.4s17k39
22DeepSeek-R1-Distill-Qwen-32B77.8%100.0%55.6%45.6s33k39
23Llama 3.3 70B Instruct77.7%94.3%61.2%60.6s23k39
24Qwen3.5-397B-A17B77.4%65.7%89.2%30.8s81k39
25Llama 4 Scout75.0%97.1%52.8%4.5s23k39
26Qwen3.5-35B-A3B70.9%54.3%87.5%30.1s174k39