We Benchmarked 12 Small Language Models Across 8 Tasks to Find the Best Base Model for Fine-Tuning
TL;DR
- Fine-tuned Qwen3-4B matches or exceeds GPT-OSS-120B (a 30x larger teacher) on 7 of 8 benchmarks, with a +19 point gain on SQuAD 2.0
- Qwen3 models consistently deliver the strongest fine-tuned results, with the 4B version ranking first overall
- Smaller models show greater improvement from fine-tuning than larger counterparts, making them viable for resource-constrained environments
Introduction
Selecting which small language model to fine-tune from a crowded landscape (Qwen, Llama, Gemma, Granite, SmolLM) is a practical challenge. We benchmarked 12 models across 8 tasks to answer:
- Which model produces the best fine-tuned results?
- Which is most tunable (gains most from training)?
- Which has best base performance?
- Can a small student model match a large teacher?
Models Evaluated
- Qwen3: 8B, 4B-Instruct-2507, 1.7B, 0.6B
- Llama: 3.1-8B-Instruct, 3.2-3B-Instruct, 3.2-1B-Instruct
- SmolLM2: 1.7B-Instruct, 135M-Instruct
- Gemma: 3-1b-it, 3-270m-it
- Granite: 3.3-8b-instruct
Training setup: 10,000 synthetic training examples per benchmark, 4 epochs, learning rate 5e-5, linear scheduler, LoRA rank 64. Identical hyperparameters across all models.
Q1: Best Fine-Tuned Performance
Winner: Qwen3-4B-Instruct-2507
| Model | Average Rank | 95% CI |
|---|---|---|
| Qwen3-4B-Instruct-2507 | 2.25 | +/- 1.03 |
| Qwen3-8B | 2.75 | +/- 1.37 |
| Llama-3.1-8B-Instruct | 4.00 | +/- 1.42 |
| Qwen3-1.7B | 4.44 | +/- 1.60 |
| Llama-3.2-3B-Instruct | 4.56 | +/- 1.73 |
| Qwen3-0.6B | 5.11 | +/- 1.86 |
The Qwen3-4B variant outranks the larger 8B model — the more recent version yields superior distillation results.
Q2: Most Tunable Models
Winner: Llama-3.2-1B-Instruct
| Model | Average Rank | 95% CI |
|---|---|---|
| Llama-3.2-1B-Instruct | 3.44 | +/- 1.31 |
| Llama-3.2-3B-Instruct | 4.67 | +/- 1.93 |
| Qwen3-0.6B | 4.78 | +/- 1.78 |
| SmolLM2-1.7B-Instruct | 5.00 | +/- 1.46 |
| gemma-3-270m-it | 5.00 | +/- 2.77 |
Smaller models like Llama-3.2-1B and Qwen3-0.6B show the largest gains from fine-tuning.
Q3: Best Base Performance
Winner: Qwen3-8B
| Model | Average Rank | 95% CI |
|---|---|---|
| Qwen3-8B | 1.75 | +/- 0.72 |
| granite-3.3-8b-instruct | 2.57 | +/- 0.84 |
| Qwen3-4B-Instruct-2507 | 3.75 | +/- 1.27 |
| Llama-3.1-8B-Instruct | 4.14 | +/- 2.11 |
| Qwen3-1.7B | 4.78 | +/- 1.02 |
Base performance correlates with model size — 8B variants dominate zero-shot/few-shot scenarios.
Q4: Student vs Teacher
Qwen3-4B matches or exceeds the 120B+ teacher on 7 of 8 benchmarks:
| Benchmark | Teacher | Qwen3-4B Fine-tuned | Qwen3-4B Base | Delta |
|---|---|---|---|---|
| TREC | 0.89 | 0.93 | 0.51 | +0.03 |
| Banking77 | 0.92 | 0.89 | 0.87 | -0.03 |
| Docs | 0.82 | 0.84 | 0.64 | +0.02 |
| Ecommerce | 0.88 | 0.90 | 0.75 | +0.03 |
| HotpotQA | 0.93 | 0.93 | 0.88 | +0.00 |
| Mental Health | 0.81 | 0.82 | 0.78 | +0.01 |
| Roman Empire QA | 0.75 | 0.80 | 0.65 | +0.05 |
| SQuAD 2.0 | 0.52 | 0.71 | 0.26 | +0.19 |
The student surpasses the 120B+ teacher by 19 points on SQuAD 2.0 closed-book QA — demonstrating how fine-tuning embeds domain knowledge effectively.
Practical Recommendations
| Constraint | Recommended Model | Rationale |
|---|---|---|
| Maximum accuracy | Qwen3-4B-Instruct-2507 | Best fine-tuned performance |
| Very limited compute (<2B) | Llama-3.2-1B or Qwen3-0.6B | Highest tunability gains |
| No fine-tuning possible | Qwen3-8B | Best zero-shot/few-shot |
| Edge deployment | Qwen3-0.6B | Good tunability, minimal size |
Key Conclusion
Fine-tuning matters more than base model choice. A well-tuned 1B model can outperform a prompted 8B model. With proper fine-tuning, significantly smaller models achieve competitive performance at reduced computational and infrastructure costs while remaining deployable on consumer hardware.
Resources
- Small Expert Agents from 10 Examples (data generation methodology)
- Platform benchmarks
- distil labs platform