Benchmarks
Data-driven performance comparisons for small language models fine-tuned with distil labs.
Benchmark
We Benchmarked 12 Small Language Models Across 8 Tasks to Find the Best Base Model for Fine-Tuning
Qwen3-4B ranks #1 overall. Fine-tuned 4B matches or exceeds a 120B+ teacher on 7 of 8 benchmarks. A well-tuned 1B outperforms a prompted 8B.
ClassificationQuestion Answering
Read more →
Benchmark
Benchmarking the Platform
Distilled students match or exceed the teacher LLM on 8 of 10 datasets across classification, NER, open-book QA, tool calling, and closed-book QA.
ClassificationTool CallingInformation ExtractionQuestion Answering
Read more →