Distilled students match or exceed the teacher LLM on 8 of 10 datasets across classification, NER, open-book QA, tool calling, and closed-book QA.

Benchmarking the Platform

Large language models changed the way we build AI products — but they are not suited for all applications and environments. In many industries, you can’t always process data in the cloud. You need models that stay private, run fast, and fit on your hardware — without compromising on accuracy.

We put this to the test: can distilled small language models (SLMs) consistently meet a strong teacher’s (LLM) bar across varied tasks?

The short answer is yes. With task-specific distillation, compact students routinely track and surpass teacher LLMs while being dramatically smaller and faster.

How It Works

distil labs turns a few-shot learning prompt into a production-ready small model: a strong teacher LLM generates in-domain examples, we rigorously filter and balance that synthetic data, then fine-tune a compact SLM to mirror the teacher’s behavior. The student is evaluated on a held-out dataset and packaged for your runtime (vLLM, Ollama, on-prem/edge) — yielding teacher-level performance in a tiny, fast, private model.

Benchmark Results

In our benchmarks, the teacher is a strong cloud LLM, the trained student is a compact SLM fine-tuned with our platform, the base student is the SLM prompted for the task with no fine-tuning, and the seed student is an SLM trained only on the initial labeled set.

Dataset	Teacher	Trained Student	Base Student	Seed Student
Ecommerce	0.905	0.91 +/- 0.01	NA	0.88
TREC	0.85	0.92 +/- 0.005	NA	0.81
Mental Health	0.82	0.85 +/- 0.01	NA	0.80
Banking77	0.91	0.895 +/- 0.02	NA	0.76
PII Redaction	0.85 +/- 0.02	0.87 +/- 0.01	0.54 +/- 0.03	0.73 +/- 0.01
HotpotQA	0.93 +/- 0.01	0.95 +/- 0.01	0.8 +/- 0.01	0.85
Roman Empire QA	0.98 +/- 0.01	0.99 +/- 0.01	0.91 +/- 0.02	0.93 +/- 0.01
Pizza Tool Calling	0.50 +/- 0.04	0.70 +/- 0.03	0.03	0.0
Git Tool Calling	0.81 +/- 0.03	0.95 +/- 0.01	0.0	0.21 +/- 0.01
SQuAD 2.0	0.57 +/- 0.04	0.64 +/- 0.34	0.43 +/- 0.04	NA

Across all benchmarks, the distilled student consistently outperforms the seed and base baselines and reaches or exceeds the teacher on eight datasets. The only exception is Banking77 where the student sits within 1 percentage point of the teacher — within the confidence interval.

The decisive gaps versus base models show why the pipeline matters: specializing SLMs (trained student) moves the needle much more than prompt tuning baselines (base student). You can only get so far with prompting — fine-tuning is necessary to unlock really accurate small local agents.

Baselines

We compare four reference points under the same preprocessing, seed split, and evaluation harness:

Teacher (reference). Llama3 70B evaluated with a fixed prompt and k-shot examples drawn only from the training split (never test)
Trained Student (ours). Llama3 3B trained on seed + curated synthetic data generated by the teacher
Seed Student. Llama3 3B trained only on the initial seed set — shows what you get without synthesizing additional data
Base Student (lower bound). Llama3 3B prompted with the task description — shows prompting-only performance

All baselines share the same held-out test split and use the same hyperparameters.

Datasets & Metrics

We evaluated five families of tasks:

Classification

TREC — Question classification into 8 categories (dataset)
BANKING77 — Customer inquiry classification (dataset)
E-commerce — Product description categorization (dataset)
Mental Health — User comment classification across 4 categories (dataset)

Information Extraction

PII Redaction — Maps customer/support texts to redacted versions removing sensitive personal data

Open Book QA

HotpotQA — Complex, multi-hop questions requiring reasoning across multiple documents (paper)
Roman Empire QA — Litmus test dataset for model training (GitHub)

Tool Calling

Pizza Tool Calling — Given a recipe and current state, suggest the next step using a tool call
Git Tool Calling — Turn plain-English queries into git commands via function calls

Closed Book QA

SQuAD 2.0 — Question-answering without context, requiring the model to internalize knowledge during training (explorer)

Conclusion

If you care about shipping private, low-latency AI without conceding accuracy, small models can carry the load — provided you give them the right training signal. These benchmarks show that disciplined generation and curation reliably deliver teacher-level performance in compact students, unlocking deployment flexibility from on-prem racks to edge devices.

If you have a task in mind, start with a short description and a handful of examples — we’ll take it from there.

Benchmarking the Platform

Benchmarking the Platform

How It Works

Benchmark Results

Baselines

Datasets & Metrics

Classification

Information Extraction

Open Book QA

Tool Calling

Closed Book QA

Conclusion

Resources

Keep Learning