Training Data
Also known as: Training Dataset, Training Set, Training Corpus
The dataset used to teach an AI model patterns and behaviors—the examples from which it learns.
Training data is the collection of examples used to teach an AI model. The quality, size, and composition of this data fundamentally shapes what the model learns.
Components
- Inputs: Raw data (text, images, audio)
- Labels: Desired outputs for supervised learning
- Metadata: Context about sources and collection
Scale
Modern foundation models train on massive datasets:
- GPT-4: Estimated trillions of tokens
- Llama 2: 2 trillion tokens
- Image models: Billions of image-text pairs
Quality Factors
- Diversity: Representing varied inputs
- Accuracy: Correct labels and clean data
- Recency: Up-to-date information
- Balance: Proportional representation
- Consent: Ethical data collection
Controversies
- Web scraping without permission
- Copyright infringement claims
- Personal data inclusion
- Bias amplification
- Lack of compensation for creators