← All content Small Models, Big Wins: Using Custom SLMs in Agentic AI

Small Models, Big Wins: Using Custom SLMs in Agentic AI

Agentic systems don't need an LLM everywhere. When to use SLMs vs LLMs, common use cases, and the decision framework.

Small Models, Big Wins: Using Custom SLMs in Agentic AI


TL;DR

  • Most agentic AI workflows consist of chains of narrow, repeatable tasks — using an LLM at every node adds unnecessary latency, cost, and energy use
  • Small language models under 10B parameters operate on standard hardware with minimal latency and improved data privacy
  • Fine-tuned SLMs can match (or beat) much larger LLMs because they stay within task boundaries
  • Modern techniques including synthetic data generation, self-improvement, and knowledge distillation reduce customization requirements
  • distil labs provides data-efficient distillation for classification, QA, function calling, and information extraction
  • Implement SLMs for specialized tasks, high-throughput applications, and privacy-sensitive scenarios; reserve LLMs for open-ended reasoning

Introduction

Large language models democratized AI feature development — prototyping by prompting an LLM became commonplace and replaced the time-consuming process of creating bespoke machine learning models. This spawned numerous “GPT-wrapper” startups creating agentic systems powered by LLMs.

However, this trajectory toward larger models may not align with actual requirements. We propose combining LLM development speed with machine learning efficiency — offering a more efficient and environmentally sustainable way of designing AI systems.


The Problem: LLMs Are Overkill for Most Agent Nodes

Advanced agentic AI applications decompose into a workflow of specialized, modular tasks: route an intent, extract fields, call a tool, log a result, etc.

Most agent tasks utilize only a very small scope of an LLM’s capabilities, while deploying LLMs at every node incurs penalties in latency, cost, and energy consumption. By contrast, specialized smaller models deliver faster, cheaper performance while enabling design patterns that help us keep our data secure.


The Alternative: Small Language Models

What is a “Small” Language Model?

Models that fit on common consumer hardware for inference. As of 2025, models below 10B parameters are classified as SLMs.

Available SLM families:

  • Llama 3 (1B, 3B, 8B)
  • Phi-3/Phi-4 (~3.8B, ~7B)
  • Qwen2 (0.5B, 1B, 7B)
  • SmolLM2 (135M, 360M, 1.7B), SmolLM3 3B
  • Gemma (270M, 1B, 2B, 4B, 7B, 9B)
  • Mistral (7B)
  • Granite (8B)

Are SLMs Good Enough?

Recent NVIDIA research suggests small language models have the right characteristics and sufficient performance to become the backbone of the next generation of agentic AI applications.

When fine-tuned for a single job (e.g., PII redaction, function selection, document classification), SLMs deliver LLM-level accuracy with far lower variance, latency, and cost.

Large models retain advantages in open-ended reasoning and broad synthesis — the mistake is assuming you need that at every node of an agent.


distil labs: SLM Fine-Tuning with Just a Prompt

distil labs pipeline

Three steps:

  1. Write a prompt (optionally attach context data)
  2. Iterate based on LLM feedback
  3. Review results within minutes

Prototype with an LLM’s speed, then ship with SLM economics: sub-second latency, lower cost, stricter guardrails, and private-by-default compute.

Supported tasks: classification, open-book QA, closed-book QA, function calling, information extraction.


SLM vs LLM: When to Choose What

Decision factorPrefer SLM when…Prefer LLM when…
Task complexityNarrow, well-defined tasks: extraction, tagging, routing, simple Q&AOpen-ended synthesis, multi-step reasoning, planning
Knowledge breadthDomain-specific, templated prompts with clear intentBroad, cross-domain queries with unclear intent
Hallucination toleranceStrict, deterministic behavior (tool calling, classification)Some creativity is fine; paired with retrieval/validation
Privacy / data residencySensitive data, must run on-device/on-prem or in strict VPCCloud processing is acceptable
LatencyTight SLAs / near-real-time UX (< ~200 ms)Seconds are acceptable
Cost per requestUltra-low cost at scale, massive volumesHigher budget per call acceptable
CustomizationModel should learn new skills specific to your use caseSatisfied with model accuracy beyond RAG
Deployment footprintConstrained hardware (CPU-only, small edge devices)Cloud inference or large GPU clusters available
Energy / sustainabilityMinimizing energy is keyNot the primary constraint

Common Use Cases for SLMs

Cybersecurity

  • Alert/log triage and SIEM/SOAR summary
  • Phishing & malware intent classification
  • DLP/PII detection at the edge

Financial Services

  • KYC/KYB document extraction
  • AML alert triage summaries
  • Transaction labeling & merchant normalization

Healthcare

  • Clinical intake structured field extraction
  • On-device dictation cleanup
  • Medical documentation information extraction

Insurance

  • Claim severity/line-of-business routing
  • Policy clause extraction and coverage checks
  • Fraud signal pre-screening

Conclusion

Agentic systems don’t need an LLM everywhere. Start with LLMs to explore capabilities, then distill the stable skills into SLMs for each node and keep LLMs only where broad reasoning is essential.

The result: faster, cheaper, greener, and more controllable AI — made practical by data-efficient SLM fine-tuning.


Sources & Further Reading


Keep Learning