Held-out coding leaderboard

Each model ranked by its held-out score — only challenges published after the model's release, which it could not have trained on. Models without enough held-out evidence yet are listed as provisional below. Safety and agentic ability are scored separately.

Fits my hardware:All ≤8 GB ≤12 GB ≤16 GB ≤24 GB ≤48 GB

Quant:Rank by:

#	Model	Held-out	All-corpus	Math	Agentic	Planner	Safety	Calibration	Self-repair	Truncation	Solved	Efficiency	Best run
1	qwen3-coder-next	0.891 39 clean · 100% dated	0.730	0.68 (22)	—	—	0.71	0.69	0.56	✂ 0.25	111/159	36 LOC69 MB0.3s	UD-Q4_K_XL · 24 GB · runner verified
2	qwen3-coder	0.869 39 clean · 100% dated	0.720	0.36 (22)	—	—	0.93	0.77	0.30	✂ 0.10	102/159	41 LOC67 MB0.3s	UD-Q4_K_XL · 24 GB · runner verified
3	phi-4-mini	0.319 74 clean · 100% dated	0.444	0.04 (22)	—	—	0.64	0.68	0.07	✂ 0.03	57/159	26 LOC62 MB1.7s	Q6_K · 24 GB · runner verified

3 models.