Glossary

What is inference?

Inference is the act of running a trained model to generate output from an input — the live operation you pay for and wait on every time an AI system responds.

← All glossary terms

Inference is what happens when you actually use a trained model: you give it an input and it computes an output. It is the counterpart to training. Training is the expensive, one-time process of building the model; inference is the repeated, ongoing process of running it to serve real requests. Every time a chat assistant answers, a document is classified, or an agent calls a model, that is an inference call.

For a language model, inference works by processing the input tokens and then generating output one token at a time, each new token conditioned on everything so far. This token-by-token generation is why responses stream in and why longer outputs take longer and cost more. Inference can run on a hosted API, where the provider manages the hardware and you pay per token, or on your own infrastructure with open-weight models, where you manage the GPUs and the serving stack yourself.

In production, inference is where the operating cost and the user-facing latency live, so it's where a lot of applied-AI engineering concentrates. Techniques to manage it include caching repeated work, batching requests, routing easy tasks to smaller cheaper models and hard ones to larger models, streaming output so users see progress immediately, and right-sizing the model to the task rather than always reaching for the largest. Getting inference economics right is often the difference between a feature that's viable at scale and one that's too slow or too expensive to ship.

Inference matters because it is the part of AI you pay for forever. Training cost is a fixed investment; inference cost recurs with every single use and grows with adoption, so a system that's cheap in a demo can become ruinously expensive at scale if inference isn't engineered. It is also where reliability and speed are felt by users. Understanding inference — its cost, its latency, and the levers that control both — is essential to building AI that's not just capable but economically and operationally sustainable.

From definition to deployment

Understanding the term is step one. Bring us the problem and we'll build the system that solves it — and prove it moved the number.

Start a conversation

See our work