AI & Generative Media

Training Data

Also known as: Training Dataset, Training Set, Training Corpus

The dataset used to teach an AI model patterns and behaviors—the examples from which it learns.

Training data is the collection of examples used to teach an AI model. The quality, size, and composition of this data fundamentally shapes what the model learns.

Components

  • Inputs: Raw data (text, images, audio)
  • Labels: Desired outputs for supervised learning
  • Metadata: Context about sources and collection

Scale

Modern foundation models train on massive datasets:

  • GPT-4: Estimated trillions of tokens
  • Llama 2: 2 trillion tokens
  • Image models: Billions of image-text pairs

Quality Factors

  • Diversity: Representing varied inputs
  • Accuracy: Correct labels and clean data
  • Recency: Up-to-date information
  • Balance: Proportional representation
  • Consent: Ethical data collection

Controversies

  • Web scraping without permission
  • Copyright infringement claims
  • Personal data inclusion
  • Bias amplification
  • Lack of compensation for creators

External Resources