Generative AI — Images, Video & Audio
How AI creates images, video, music and speech from scratch
Contents
What makes AI "generative"?
Traditional AI classifies or predicts from existing data — "is this a cat or dog?" Generative AI creates new data — "generate an image of a cat wearing a spacesuit."
The key insight: if you train a model to understand the relationship between descriptions and images deeply enough, you can run it in reverse — give it a description and ask it to create the matching image.
This works because generative models learn the statistical patterns of what things look like, sound like, or read like — and can sample from those patterns to create new examples.
Diffusion models — how Stable Diffusion and Midjourney work
The dominant approach for image generation is diffusion. The intuition:
Training: Take millions of real images. Add random noise to each image progressively until it becomes pure static. Train a neural network to predict and remove the noise at each step.
Generation: Start with pure random noise. Run the denoising network in reverse — gradually removing noise according to a text prompt that guides the direction. After 20-50 steps, a coherent image emerges.
This is why generation takes multiple "steps" — you literally watch the model sculpt an image out of noise, guided by your prompt. Flux.1, Stable Diffusion XL, and Midjourney v6 all use variations of this approach.
Text-to-speech and voice cloning
Modern TTS (text-to-speech) uses neural networks trained on thousands of hours of human speech. The model learns to map text → audio patterns, including prosody (rhythm and emphasis), emotion, and speaker identity.
Voice cloning works by extracting a "voice embedding" — a numerical fingerprint of a specific voice from just a few seconds of audio. The TTS model then generates speech in that voice from any text.
ElevenLabs, Kokoro, and Coqui TTS all use variations of this approach. The open-source Kokoro model (82M parameters) can run in real-time on a laptop CPU and produces near-professional quality.
Video generation — the frontier
Video generation is dramatically harder than image generation because every frame must be consistent with every other frame — maintaining identity, physics, and lighting across time.
Modern video models like Wan 2.1, LTX Video, and HunyuanVideo extend the diffusion approach to 3D (width × height × time) or use transformer architectures that process video tokens similarly to text tokens.
The "consistent character" problem — keeping a face identical across scenes — is one of the hardest open problems. Current solutions use IP-Adapter (face identity embeddings) or fine-tuning on a specific subject, but the results are still imperfect.
Sora, Runway Gen-4, and the open-source alternatives are all making rapid progress — expect video quality to match image quality within 1-2 years.