Benchmarking the Platform
Large language models changed the way we build AI products — but they are not suited for all applications and environments. In many industries, you can’t always process data in the cloud. You need models that stay private, run fast, and fit on your hardware — without compromising on accuracy.
We put this to the test: can distilled small language models (SLMs) consistently meet a strong teacher’s (LLM) bar across varied tasks?
The short answer is yes. With task-specific distillation, compact students routinely track and surpass teacher LLMs while being dramatically smaller and faster.
How It Works
distil labs turns a few-shot learning prompt into a production-ready small model: a strong teacher LLM generates in-domain examples, we rigorously filter and balance that synthetic data, then fine-tune a compact SLM to mirror the teacher’s behavior. The student is evaluated on a held-out dataset and packaged for your runtime (vLLM, Ollama, on-prem/edge) — yielding teacher-level performance in a tiny, fast, private model.
Benchmark Results
In our benchmarks, the teacher is a strong cloud LLM, the trained student is a compact SLM fine-tuned with our platform, the base student is the SLM prompted for the task with no fine-tuning, and the seed student is an SLM trained only on the initial labeled set.
| Dataset | Teacher | Trained Student | Base Student | Seed Student |
|---|---|---|---|---|
| Ecommerce | 0.905 | 0.91 +/- 0.01 | NA | 0.88 |
| TREC | 0.85 | 0.92 +/- 0.005 | NA | 0.81 |
| Mental Health | 0.82 | 0.85 +/- 0.01 | NA | 0.80 |
| Banking77 | 0.91 | 0.895 +/- 0.02 | NA | 0.76 |
| PII Redaction | 0.85 +/- 0.02 | 0.87 +/- 0.01 | 0.54 +/- 0.03 | 0.73 +/- 0.01 |
| HotpotQA | 0.93 +/- 0.01 | 0.95 +/- 0.01 | 0.8 +/- 0.01 | 0.85 |
| Roman Empire QA | 0.98 +/- 0.01 | 0.99 +/- 0.01 | 0.91 +/- 0.02 | 0.93 +/- 0.01 |
| Pizza Tool Calling | 0.50 +/- 0.04 | 0.70 +/- 0.03 | 0.03 | 0.0 |
| Git Tool Calling | 0.81 +/- 0.03 | 0.95 +/- 0.01 | 0.0 | 0.21 +/- 0.01 |
| SQuAD 2.0 | 0.57 +/- 0.04 | 0.64 +/- 0.34 | 0.43 +/- 0.04 | NA |
Across all benchmarks, the distilled student consistently outperforms the seed and base baselines and reaches or exceeds the teacher on eight datasets. The only exception is Banking77 where the student sits within 1 percentage point of the teacher — within the confidence interval.
The decisive gaps versus base models show why the pipeline matters: specializing SLMs (trained student) moves the needle much more than prompt tuning baselines (base student). You can only get so far with prompting — fine-tuning is necessary to unlock really accurate small local agents.
Baselines
We compare four reference points under the same preprocessing, seed split, and evaluation harness:
- Teacher (reference). Llama3 70B evaluated with a fixed prompt and k-shot examples drawn only from the training split (never test)
- Trained Student (ours). Llama3 3B trained on seed + curated synthetic data generated by the teacher
- Seed Student. Llama3 3B trained only on the initial seed set — shows what you get without synthesizing additional data
- Base Student (lower bound). Llama3 3B prompted with the task description — shows prompting-only performance
All baselines share the same held-out test split and use the same hyperparameters.
Datasets & Metrics
We evaluated five families of tasks:
Classification
- TREC — Question classification into 8 categories (dataset)
- BANKING77 — Customer inquiry classification (dataset)
- E-commerce — Product description categorization (dataset)
- Mental Health — User comment classification across 4 categories (dataset)
Information Extraction
- PII Redaction — Maps customer/support texts to redacted versions removing sensitive personal data
Open Book QA
- HotpotQA — Complex, multi-hop questions requiring reasoning across multiple documents (paper)
- Roman Empire QA — Litmus test dataset for model training (GitHub)
Tool Calling
- Pizza Tool Calling — Given a recipe and current state, suggest the next step using a tool call
- Git Tool Calling — Turn plain-English queries into git commands via function calls
Closed Book QA
- SQuAD 2.0 — Question-answering without context, requiring the model to internalize knowledge during training (explorer)
Conclusion
If you care about shipping private, low-latency AI without conceding accuracy, small models can carry the load — provided you give them the right training signal. These benchmarks show that disciplined generation and curation reliably deliver teacher-level performance in compact students, unlocking deployment flexibility from on-prem racks to edge devices.
If you have a task in mind, start with a short description and a handful of examples — we’ll take it from there.