Qwen3-4B ranks #1 overall. Fine-tuned 4B matches or exceeds a 120B+ teacher on 7 of 8 benchmarks. A well-tuned 1B outperforms a prompted 8B.

We Benchmarked 12 Small Language Models Across 8 Tasks to Find the Best Base Model for Fine-Tuning

TL;DR

Fine-tuned Qwen3-4B matches or exceeds GPT-OSS-120B (a 30x larger teacher) on 7 of 8 benchmarks, with a +19 point gain on SQuAD 2.0
Qwen3 models consistently deliver the strongest fine-tuned results, with the 4B version ranking first overall
Smaller models show greater improvement from fine-tuning than larger counterparts, making them viable for resource-constrained environments

Introduction

Selecting which small language model to fine-tune from a crowded landscape (Qwen, Llama, Gemma, Granite, SmolLM) is a practical challenge. We benchmarked 12 models across 8 tasks to answer:

Which model produces the best fine-tuned results?
Which is most tunable (gains most from training)?
Which has best base performance?
Can a small student model match a large teacher?

Models Evaluated

Qwen3: 8B, 4B-Instruct-2507, 1.7B, 0.6B
Llama: 3.1-8B-Instruct, 3.2-3B-Instruct, 3.2-1B-Instruct
SmolLM2: 1.7B-Instruct, 135M-Instruct
Gemma: 3-1b-it, 3-270m-it
Granite: 3.3-8b-instruct

Training setup: 10,000 synthetic training examples per benchmark, 4 epochs, learning rate 5e-5, linear scheduler, LoRA rank 64. Identical hyperparameters across all models.

Q1: Best Fine-Tuned Performance

Winner: Qwen3-4B-Instruct-2507

Model	Average Rank	95% CI
Qwen3-4B-Instruct-2507	2.25	+/- 1.03
Qwen3-8B	2.75	+/- 1.37
Llama-3.1-8B-Instruct	4.00	+/- 1.42
Qwen3-1.7B	4.44	+/- 1.60
Llama-3.2-3B-Instruct	4.56	+/- 1.73
Qwen3-0.6B	5.11	+/- 1.86

The Qwen3-4B variant outranks the larger 8B model — the more recent version yields superior distillation results.

Q2: Most Tunable Models

Winner: Llama-3.2-1B-Instruct

Model	Average Rank	95% CI
Llama-3.2-1B-Instruct	3.44	+/- 1.31
Llama-3.2-3B-Instruct	4.67	+/- 1.93
Qwen3-0.6B	4.78	+/- 1.78
SmolLM2-1.7B-Instruct	5.00	+/- 1.46
gemma-3-270m-it	5.00	+/- 2.77

Smaller models like Llama-3.2-1B and Qwen3-0.6B show the largest gains from fine-tuning.

Q3: Best Base Performance

Winner: Qwen3-8B

Model	Average Rank	95% CI
Qwen3-8B	1.75	+/- 0.72
granite-3.3-8b-instruct	2.57	+/- 0.84
Qwen3-4B-Instruct-2507	3.75	+/- 1.27
Llama-3.1-8B-Instruct	4.14	+/- 2.11
Qwen3-1.7B	4.78	+/- 1.02

Base performance correlates with model size — 8B variants dominate zero-shot/few-shot scenarios.

Q4: Student vs Teacher

Qwen3-4B matches or exceeds the 120B+ teacher on 7 of 8 benchmarks:

Benchmark	Teacher	Qwen3-4B Fine-tuned	Qwen3-4B Base	Delta
TREC	0.89	0.93	0.51	+0.03
Banking77	0.92	0.89	0.87	-0.03
Docs	0.82	0.84	0.64	+0.02
Ecommerce	0.88	0.90	0.75	+0.03
HotpotQA	0.93	0.93	0.88	+0.00
Mental Health	0.81	0.82	0.78	+0.01
Roman Empire QA	0.75	0.80	0.65	+0.05
SQuAD 2.0	0.52	0.71	0.26	+0.19

The student surpasses the 120B+ teacher by 19 points on SQuAD 2.0 closed-book QA — demonstrating how fine-tuning embeds domain knowledge effectively.

Practical Recommendations

Constraint	Recommended Model	Rationale
Maximum accuracy	Qwen3-4B-Instruct-2507	Best fine-tuned performance
Very limited compute (<2B)	Llama-3.2-1B or Qwen3-0.6B	Highest tunability gains
No fine-tuning possible	Qwen3-8B	Best zero-shot/few-shot
Edge deployment	Qwen3-0.6B	Good tunability, minimal size

Key Conclusion

Fine-tuning matters more than base model choice. A well-tuned 1B model can outperform a prompted 8B model. With proper fine-tuning, significantly smaller models achieve competitive performance at reduced computational and infrastructure costs while remaining deployable on consumer hardware.

Resources

Small Expert Agents from 10 Examples (data generation methodology)
Platform benchmarks
distil labs platform

We Benchmarked 12 Small Language Models Across 8 Tasks to Find the Best Base Model for Fine-Tuning

We Benchmarked 12 Small Language Models Across 8 Tasks to Find the Best Base Model for Fine-Tuning

TL;DR

Introduction

Models Evaluated

Q1: Best Fine-Tuned Performance

Q2: Most Tunable Models

Q3: Best Base Performance

Q4: Student vs Teacher

Practical Recommendations

Key Conclusion

Resources

Keep Learning