Inference
Also known as: Model Inference, AI Inference, Prediction
The process of running a trained AI model to generate predictions or outputs from new inputs—the 'using' phase as opposed to training.
Inference is when a trained AI model processes new inputs to generate outputs—predictions, text, images, or decisions.
Training vs. Inference
| Training | Inference |
|---|---|
| Learns patterns from data | Applies learned patterns |
| Computationally expensive | Relatively lightweight |
| Done once (or periodically) | Done continuously |
| Requires labeled data | Processes new inputs |
Infrastructure
Inference can run on:
- Cloud APIs: OpenAI, Anthropic, Google
- Edge devices: Phones, IoT, embedded systems
- On-premise: Private servers for data security
- Specialized hardware: GPUs, TPUs, inference chips
Costs
While cheaper than training, inference costs add up at scale. Optimization techniques include:
- Model quantization (smaller precision)
- Batching requests
- Caching common responses
- Smaller, distilled models