Agentic systems don't need an LLM everywhere. When to use SLMs vs LLMs, common use cases, and the decision framework.

Small Models, Big Wins: Using Custom SLMs in Agentic AI

TL;DR

Most agentic AI workflows consist of chains of narrow, repeatable tasks — using an LLM at every node adds unnecessary latency, cost, and energy use
Small language models under 10B parameters operate on standard hardware with minimal latency and improved data privacy
Fine-tuned SLMs can match (or beat) much larger LLMs because they stay within task boundaries
Modern techniques including synthetic data generation, self-improvement, and knowledge distillation reduce customization requirements
distil labs provides data-efficient distillation for classification, QA, function calling, and information extraction
Implement SLMs for specialized tasks, high-throughput applications, and privacy-sensitive scenarios; reserve LLMs for open-ended reasoning

Introduction

Large language models democratized AI feature development — prototyping by prompting an LLM became commonplace and replaced the time-consuming process of creating bespoke machine learning models. This spawned numerous “GPT-wrapper” startups creating agentic systems powered by LLMs.

However, this trajectory toward larger models may not align with actual requirements. We propose combining LLM development speed with machine learning efficiency — offering a more efficient and environmentally sustainable way of designing AI systems.

The Problem: LLMs Are Overkill for Most Agent Nodes

Advanced agentic AI applications decompose into a workflow of specialized, modular tasks: route an intent, extract fields, call a tool, log a result, etc.

Most agent tasks utilize only a very small scope of an LLM’s capabilities, while deploying LLMs at every node incurs penalties in latency, cost, and energy consumption. By contrast, specialized smaller models deliver faster, cheaper performance while enabling design patterns that help us keep our data secure.

The Alternative: Small Language Models

What is a “Small” Language Model?

Models that fit on common consumer hardware for inference. As of 2025, models below 10B parameters are classified as SLMs.

Available SLM families:

Llama 3 (1B, 3B, 8B)
Phi-3/Phi-4 (~3.8B, ~7B)
Qwen2 (0.5B, 1B, 7B)
SmolLM2 (135M, 360M, 1.7B), SmolLM3 3B
Gemma (270M, 1B, 2B, 4B, 7B, 9B)
Mistral (7B)
Granite (8B)

Are SLMs Good Enough?

Recent NVIDIA research suggests small language models have the right characteristics and sufficient performance to become the backbone of the next generation of agentic AI applications.

When fine-tuned for a single job (e.g., PII redaction, function selection, document classification), SLMs deliver LLM-level accuracy with far lower variance, latency, and cost.

Large models retain advantages in open-ended reasoning and broad synthesis — the mistake is assuming you need that at every node of an agent.

distil labs: SLM Fine-Tuning with Just a Prompt

distil labs pipeline

Three steps:

Write a prompt (optionally attach context data)
Iterate based on LLM feedback
Review results within minutes

Prototype with an LLM’s speed, then ship with SLM economics: sub-second latency, lower cost, stricter guardrails, and private-by-default compute.

Supported tasks: classification, open-book QA, closed-book QA, function calling, information extraction.

SLM vs LLM: When to Choose What

Decision factor	Prefer SLM when…	Prefer LLM when…
Task complexity	Narrow, well-defined tasks: extraction, tagging, routing, simple Q&A	Open-ended synthesis, multi-step reasoning, planning
Knowledge breadth	Domain-specific, templated prompts with clear intent	Broad, cross-domain queries with unclear intent
Hallucination tolerance	Strict, deterministic behavior (tool calling, classification)	Some creativity is fine; paired with retrieval/validation
Privacy / data residency	Sensitive data, must run on-device/on-prem or in strict VPC	Cloud processing is acceptable
Latency	Tight SLAs / near-real-time UX (< ~200 ms)	Seconds are acceptable
Cost per request	Ultra-low cost at scale, massive volumes	Higher budget per call acceptable
Customization	Model should learn new skills specific to your use case	Satisfied with model accuracy beyond RAG
Deployment footprint	Constrained hardware (CPU-only, small edge devices)	Cloud inference or large GPU clusters available
Energy / sustainability	Minimizing energy is key	Not the primary constraint

Common Use Cases for SLMs

Cybersecurity

Alert/log triage and SIEM/SOAR summary
Phishing & malware intent classification
DLP/PII detection at the edge

Financial Services

KYC/KYB document extraction
AML alert triage summaries
Transaction labeling & merchant normalization

Healthcare

Clinical intake structured field extraction
On-device dictation cleanup
Medical documentation information extraction

Insurance

Claim severity/line-of-business routing
Policy clause extraction and coverage checks
Fraud signal pre-screening

Conclusion

Agentic systems don’t need an LLM everywhere. Start with LLMs to explore capabilities, then distill the stable skills into SLMs for each node and keep LLMs only where broad reasoning is essential.

The result: faster, cheaper, greener, and more controllable AI — made practical by data-efficient SLM fine-tuning.

Sources & Further Reading

Belcak et al., “Small Language Models are the Future of Agentic AI”
Sudalairaj et al., “LAB: Large-Scale Alignment for ChatBots”
Wang et al., “Self-Instruct”
Bai et al., “Constitutional AI”
Li et al., “Self-Alignment with Instruction Backtranslation”
Hinton et al., “Distilling the Knowledge in a Neural Network”

Small Models, Big Wins: Using Custom SLMs in Agentic AI

Small Models, Big Wins: Using Custom SLMs in Agentic AI

TL;DR

Introduction

The Problem: LLMs Are Overkill for Most Agent Nodes

The Alternative: Small Language Models

What is a “Small” Language Model?

Are SLMs Good Enough?

distil labs: SLM Fine-Tuning with Just a Prompt

SLM vs LLM: When to Choose What

Common Use Cases for SLMs

Cybersecurity

Financial Services

Healthcare

Insurance

Conclusion

Sources & Further Reading

Keep Learning