Benchmarks

Data-driven performance comparisons for small language models fine-tuned with distil labs.

We Benchmarked 12 Small Language Models Across 8 Tasks to Find the Best Base Model for Fine-Tuning

Qwen3-4B ranks #1 overall. Fine-tuned 4B matches or exceeds a 120B+ teacher on 7 of 8 benchmarks. A well-tuned 1B outperforms a prompted 8B.

Distilled students match or exceed the teacher LLM on 8 of 10 datasets across classification, NER, open-book QA, tool calling, and closed-book QA.