Held-out coding leaderboard

Each model ranked by its held-out score — only challenges published after the model's release, which it could not have trained on. Models without enough held-out evidence yet are listed as provisional below. Safety and agentic ability are scored separately.

Fits my hardware:All≤8 GB≤12 GB≤16 GB≤24 GB≤48 GB
#ModelHeld-outAll-corpusMathAgenticPlannerSafetyCalibrationSelf-repairTruncationSolvedEfficiencyBest run
1qwen3-coder-next
0.891
39 clean · 100% dated
0.730
0.68 (22)0.710.690.560.25111/15936 LOC69 MB0.3sUD-Q4_K_XL · 24 GB · runner verified
2qwen3-coder
0.869
39 clean · 100% dated
0.720
0.36 (22)0.930.770.300.10102/15941 LOC67 MB0.3sUD-Q4_K_XL · 24 GB · runner verified
3phi-4-mini
0.319
74 clean · 100% dated
0.444
0.04 (22)0.640.680.070.0357/15926 LOC62 MB1.7sQ6_K · 24 GB · runner verified

3 models.