← All content

We Benchmarked 12 Small Language Models Across 8 Tasks to Find the Best Base Model for Fine-Tuning

Qwen3-4B ranks #1 overall. Fine-tuned 4B matches or exceeds a 120B+ teacher on 7 of 8 benchmarks. A well-tuned 1B outperforms a prompted 8B.

We Benchmarked 12 Small Language Models Across 8 Tasks to Find the Best Base Model for Fine-Tuning

TL;DR

  1. Fine-tuned Qwen3-4B matches or exceeds GPT-OSS-120B (a 30x larger teacher) on 7 of 8 benchmarks, with a +19 point gain on SQuAD 2.0
  2. Qwen3 models consistently deliver the strongest fine-tuned results, with the 4B version ranking first overall
  3. Smaller models show greater improvement from fine-tuning than larger counterparts, making them viable for resource-constrained environments

Introduction

Selecting which small language model to fine-tune from a crowded landscape (Qwen, Llama, Gemma, Granite, SmolLM) is a practical challenge. We benchmarked 12 models across 8 tasks to answer:

  • Which model produces the best fine-tuned results?
  • Which is most tunable (gains most from training)?
  • Which has best base performance?
  • Can a small student model match a large teacher?

Models Evaluated

  • Qwen3: 8B, 4B-Instruct-2507, 1.7B, 0.6B
  • Llama: 3.1-8B-Instruct, 3.2-3B-Instruct, 3.2-1B-Instruct
  • SmolLM2: 1.7B-Instruct, 135M-Instruct
  • Gemma: 3-1b-it, 3-270m-it
  • Granite: 3.3-8b-instruct

Training setup: 10,000 synthetic training examples per benchmark, 4 epochs, learning rate 5e-5, linear scheduler, LoRA rank 64. Identical hyperparameters across all models.


Q1: Best Fine-Tuned Performance

Winner: Qwen3-4B-Instruct-2507

ModelAverage Rank95% CI
Qwen3-4B-Instruct-25072.25+/- 1.03
Qwen3-8B2.75+/- 1.37
Llama-3.1-8B-Instruct4.00+/- 1.42
Qwen3-1.7B4.44+/- 1.60
Llama-3.2-3B-Instruct4.56+/- 1.73
Qwen3-0.6B5.11+/- 1.86

The Qwen3-4B variant outranks the larger 8B model — the more recent version yields superior distillation results.


Q2: Most Tunable Models

Winner: Llama-3.2-1B-Instruct

ModelAverage Rank95% CI
Llama-3.2-1B-Instruct3.44+/- 1.31
Llama-3.2-3B-Instruct4.67+/- 1.93
Qwen3-0.6B4.78+/- 1.78
SmolLM2-1.7B-Instruct5.00+/- 1.46
gemma-3-270m-it5.00+/- 2.77

Smaller models like Llama-3.2-1B and Qwen3-0.6B show the largest gains from fine-tuning.


Q3: Best Base Performance

Winner: Qwen3-8B

ModelAverage Rank95% CI
Qwen3-8B1.75+/- 0.72
granite-3.3-8b-instruct2.57+/- 0.84
Qwen3-4B-Instruct-25073.75+/- 1.27
Llama-3.1-8B-Instruct4.14+/- 2.11
Qwen3-1.7B4.78+/- 1.02

Base performance correlates with model size — 8B variants dominate zero-shot/few-shot scenarios.


Q4: Student vs Teacher

Qwen3-4B matches or exceeds the 120B+ teacher on 7 of 8 benchmarks:

BenchmarkTeacherQwen3-4B Fine-tunedQwen3-4B BaseDelta
TREC0.890.930.51+0.03
Banking770.920.890.87-0.03
Docs0.820.840.64+0.02
Ecommerce0.880.900.75+0.03
HotpotQA0.930.930.88+0.00
Mental Health0.810.820.78+0.01
Roman Empire QA0.750.800.65+0.05
SQuAD 2.00.520.710.26+0.19

The student surpasses the 120B+ teacher by 19 points on SQuAD 2.0 closed-book QA — demonstrating how fine-tuning embeds domain knowledge effectively.


Practical Recommendations

ConstraintRecommended ModelRationale
Maximum accuracyQwen3-4B-Instruct-2507Best fine-tuned performance
Very limited compute (<2B)Llama-3.2-1B or Qwen3-0.6BHighest tunability gains
No fine-tuning possibleQwen3-8BBest zero-shot/few-shot
Edge deploymentQwen3-0.6BGood tunability, minimal size

Key Conclusion

Fine-tuning matters more than base model choice. A well-tuned 1B model can outperform a prompted 8B model. With proper fine-tuning, significantly smaller models achieve competitive performance at reduced computational and infrastructure costs while remaining deployable on consumer hardware.


Resources


Keep Learning