What Is Generative AI?

Over the past decade, deep learning has advanced rapidly, enabling machines to process and generate unstructured data such as text, images, and video. These advances—especially large language models (LLMs)—have fueled widespread excitement across media and industry. AI is increasingly seen as a transformative force with the potential to reshape businesses, societies, and individual work.

This chapter introduces generative models and explains, at a high level, how techniques like neural networks, large datasets, and computational scale enable models to generate human-like content. Rather than covering every modality in depth, the goal is to demystify how generative AI works and prepare readers to evaluate both its opportunities and challenges.

Let’s start with the terminology.

Introducing generative AI

Generative AI has gained significant attention due to its ability to produce text, images, and other creative content that often rivals human output. Beyond generation, these models support tasks such as semantic search, classification, and content manipulation—enabling automation and amplifying human creativity.

Generative AI refers to models that generate new content, in contrast to traditional predictive systems that only analyze existing data.

Benchmarks have been a major driver of progress. Figure 1 shows average LLM performance on the Massive Multitask Language Understanding (MMLU) benchmark, which measures knowledge and reasoning across domains such as mathematics, law, and computer science. The figure highlights rapid improvements in recent years, particularly across OpenAI’s publicly released models.

Figure 1: Average performance on the MMLU benchmark of LLMs
Figure 1: Average performance on the MMLU benchmark of LLMs

While these results should be interpreted cautiously—due to self-reporting and differences between zero-shot and five-shot evaluations—the gains between GPT-2, GPT-3, GPT-3.5, and GPT-4 are substantial. Five-shot prompting alone may account for roughly 20% of performance gains.

Key contributors to these improvements include:

  • Scaling model parameters (from billions to trillions)
  • Instruction tuning using human feedback
  • Architectural refinements and more diverse training data

Although some models now outperform average human raters on certain benchmarks, they generally fall short of expert-level performance and still struggle on tasks like grade-school math (GSM8K).

Generative Pre-trained Transformer (GPT) models—such as GPT-4—exemplify recent progress. ChatGPT’s widespread adoption demonstrates how larger models enable more capable conversational agents, supporting use cases from customer support to programming and creative writing.

These models increasingly function as generalist assistants. For example, GPT-4 performs well across subjects including mathematics, economics, biology, and law (including the Uniform Bar Exam), positioning LLMs as valuable tools for knowledge work.

OpenAI, founded in 2015, has been central to these developments. Originally a nonprofit, it transitioned to a capped for-profit model in 2019 and formed a close partnership with Microsoft. Key contributions include OpenAI Gym, the GPT model family, and DALL·E for text-to-image generation.

Generative models have also advanced computer vision, improving tasks such as object detection, segmentation, and image captioning.

What are generative models?

In popular media, artificial intelligence is often used as an umbrella term. In practice, it helps to distinguish between related concepts:

  • Artificial Intelligence (AI): Creating systems that reason, learn, and act autonomously
  • Machine Learning (ML): Algorithms that learn patterns from data
  • Deep Learning (DL): ML using deep neural networks
  • Generative Models: Models that create new data resembling training data
  • Language Models (LMs): Models that predict word sequences; large versions are LLMs

Unlike predictive models, generative models synthesize new data, enabling applications such as text, image, music, and video generation. They are also used to create synthetic training data when real data is scarce. For example, Microsoft’s phi-1 model was trained using AI-generated Python textbooks.

Common generative model categories

  • Text-to-text: GPT-4, Claude, LLaMA 2
  • Text-to-image: DALL·E 2, Stable Diffusion, Imagen
  • Text-to-audio: AudioLM, MusicGen
  • Text-to-video: Phenaki, Emu Video
  • Text-to-speech: WaveNet, Tacotron
  • Speech-to-text: Whisper
  • Image-to-text: CLIP
  • Image-to-image: Super-resolution, style transfer, inpainting
  • Text-to-code: Codex, AlphaCode

Some models are multimodal. GPT-4V, released in 2023, accepts both text and images, enabling OCR and safer image captioning by routing visual input through text-based filters.

Text is a common input modality that can be transformed into images, audio, video, or code. While this tutorial focuses mainly on LLMs, text-to-image models are also discussed. Most of these systems rely on Transformer architectures trained via self-supervised learning.

Despite rapid progress, challenges remain, including data bias, evaluation difficulty, compute costs, misuse risks, and broader societal impacts.

Why now?

The public breakthrough of generative AI around 2022 resulted from several converging factors:

  1. Algorithmic progress
    Backpropagation, improved optimizers, and advances in neural architectures laid the foundation. Transformers (2017) introduced attention mechanisms that significantly improved scalability and performance.
  2. Hardware acceleration
    GPUs—particularly through NVIDIA’s CUDA platform—dramatically reduced training time. Figure 2 shows the long-term decline in storage costs, making large-scale training economically feasible.
  3. Model scaling
    Larger models better capture linguistic structure. As models reached billions of parameters, new capabilities emerged, including creative writing, code generation, and open-ended reasoning. Lower perplexity generally correlates with better task performance.
  4. Tooling and frameworks
    Libraries such as PyTorch, TensorFlow, and Keras streamlined experimentation and deployment.
  5. Data availability
    The explosion of internet data enabled training on unprecedented volumes of text and images.
  6. Benchmarks and community
    Shared benchmarks such as ImageNet and MMLU helped standardize evaluation and accelerate progress.

Earlier generative approaches—autoencoders, VAEs, and GANs—were critical stepping stones, but Transformers and transfer learning enabled today’s large-scale models.

Figure 2: Cost of computer storage since the 1950s in dollars (unadjusted) per terabyte
Figure 2: Cost of computer storage since the 1950s in dollars (unadjusted) per terabyte

Understanding LLMs

Text generation models such as OpenAI’s GPT-4 can produce fluent, grammatically correct text across languages and formats. Their core purpose is to enable machines to understand and generate natural language, supporting tasks like content creation, translation, summarization, and text editing.

At a fundamental level, language modeling predicts the next word (or token) in a sequence based on prior context. Through this process, LLMs internalize grammar, syntax, and semantics, forming the backbone of modern NLP systems.

Central to this capability is representation learning—models learn internal representations directly from raw data rather than relying on hand-engineered features. Just as vision models learn edges and shapes from pixels, LLMs learn linguistic structure directly from text.

LLMs are widely used across domains, including:

  • Question answering: Powering chatbots and virtual assistants for customer support and task automation
  • Automatic summarization: Condensing long documents into concise summaries
  • Sentiment analysis: Extracting opinions and emotions from text
  • Topic modeling: Identifying latent themes across large document collections
  • Semantic search: Improving relevance by understanding meaning rather than keywords
  • Machine translation: Achieving near-commercial translation quality

Despite their strengths, LLMs still struggle with complex mathematical reasoning and logic. They generate the most probable response, which can lead to confident but incorrect outputs—commonly called hallucinations.

What is a GPT?

LLMs are deep neural networks based on the Transformer architecture and pre-trained on massive text corpora using self-supervised learning. Their ability to act as general-purpose “foundation models” enables reuse across many downstream tasks.

GPT models generate text autoregressively—predicting one token at a time—often producing responses indistinguishable from human writing. However, as OpenAI notes, they may sometimes generate plausible but incorrect answers.

The Transformer, introduced in Attention Is All You Need (Vaswani et al., 2017), relies on self-attention and feed-forward layers to model relationships between words regardless of distance. GPT models, introduced by OpenAI in 2018 (Improving Language Understanding by Generative Pre-Training), apply this architecture using a next-token prediction objective.

Training data and model scale have increased dramatically:

  • GPT-1 (2018): ~1B words
  • BERT (2018): ~3.3B words
  • Modern LLMs: trillions of tokens

This growth is illustrated in Figure 3, which shows model size, training cost, and organizational ownership across major LLMs.

Figure 3: LLMs from BERT to GPT-4 – size, training budget, and organizations. For the proprietary models, parameter sizes are often estimates.
Figure 3: LLMs from BERT to GPT-4 – size, training budget, and organizations. For the proprietary models, parameter sizes are often estimates.

GPT models have evolved into multimodal foundation models, supporting image input (GPT-4V) and enabling related technologies such as diffusion-based text-to-image systems.

GPT-3, trained on ~300B tokens with 175B parameters, marked a turning point. GPT-4’s exact size is undisclosed, though estimates range from hundreds of billions to over a trillion parameters, with training costs reportedly exceeding $100M.

ChatGPT, released in November 2022, demonstrated the power of instruction tuning and human feedback (RLHF), delivering strong multi-turn dialogue and safety improvements. GPT-4 further improved performance and alignment through extended fine-tuning.

Other LLMs

Several non-OpenAI models have advanced the field:

  • PaLM 2 (Google DeepMind): Smaller, more compute-efficient, with strong multilingual and reasoning performance
  • LLaMA & LLaMA 2 (Meta): Open models up to 70B parameters that catalyzed an explosion of open-source LLMs
  • Claude / Claude 2 (Anthropic): Strong GPT-4 competitor with up to 200K token context windows
  • Mistral, Falcon, GPT-Neo, GPT-J: Open or semi-open alternatives enabling broader experimentation

While some open models approach closed-source systems on specific benchmarks, a significant performance gap remains at the very top end.

Major players

Training modern LLMs requires massive compute budgets—often $10M to $100M+—and specialized expertise. As shown in Figure 3, only a handful of organizations can train frontier models at scale, including OpenAI, Google, Meta, and Anthropic.

However, open-source releases like LLaMA have enabled smaller teams to achieve impressive results through fine-tuning and specialization, especially in coding and reasoning tasks.

How do GPT models work?

The Transformer architecture revolutionized NLP by enabling parallel processing and long-range dependency modeling. It uses an encoder–decoder structure with self-attention, feed-forward layers, residual connections, and normalization.

Figure 4 illustrates the Transformer architecture.

Figure 4: The Transformer architecture

Key architectural features include:

  • Positional encoding: Injects word order information
  • Layer normalization: Stabilizes and accelerates training
  • Multi-head attention: Captures different relational patterns in parallel

Transformers outperform recurrent models (e.g., LSTMs), especially on long contexts. Variants such as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) improve inference speed and memory efficiency, enabling larger context windows.

Pre-training

LLMs are trained in two stages:

  1. Unsupervised pre-training: Learning general language representations
  2. Fine-tuning / alignment: Adapting models for tasks and safety

Objectives include:

  • Masked Language Modeling (MLM) (e.g., BERT)
  • Next-token prediction (e.g., GPT)

Training minimizes Negative Log-Likelihood (NLL) and Perplexity (PPL). Lower perplexity indicates better predictive performance and less “surprise” at the next token.

Tokenization

Before training, text is converted into tokens and mapped to numeric IDs:

  • Tokens may be words, subwords, or characters
  • Common tokenizers include BPE, WordPiece, and SentencePiece

Example:

“The quick brown fox jumps over the lazy dog!”
→ [“The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”, “!”]

Tokenizers are fitted once and frozen. Vocabulary sizes typically range from 30K–50K tokens. The context window limits how many tokens a model can process at once.

Scaling

As shown in Figure 3, LLMs have steadily grown larger. Scaling laws (Kaplan et al., 2020) show that performance improves predictably with:

  • Model size
  • Dataset size
  • Compute budget

Optimal performance requires scaling all three together. Later work (Chinchilla, 2022) showed that many models were undertrained, while more recent research (Textbooks Are All You Need, 2023) demonstrated that smaller models trained on high-quality data can be competitive.

Whether continued scaling will yield further breakthroughs remains an open question.

Conditioning

After pre-training, models are adapted via:

  • Fine-tuning: Supervised learning and RLHF for helpfulness and safety
  • Prompting: Zero-shot or few-shot task conditioning via natural language instructions

These methods enable flexible task adaptation without retraining entire models.

How to try these models

Closed models (e.g., GPT-4) are accessible via APIs, while open-source LLMs can be run locally via platforms like Hugging Face. Users can fine-tune or deploy these models depending on their needs.

While this tutorial focuses on LLMs due to their broad applicability, text-to-image and other generative models also play an important role. The next section explores text-conditioned image generation, its progress, and remaining challenges.

What are text-to-image models?

Text-to-image models are a class of generative AI systems that create realistic images from textual descriptions. They are widely used in creative industries for advertising, design, prototyping, fashion, and visual effects.

Key applications include:

  • Text-conditioned image generation: Creating images directly from prompts such as “a painting of a cat in a field of flowers.”
  • Image inpainting: Filling in missing or corrupted regions for restoration or object removal.
  • Image-to-image translation: Transforming images into new styles or domains (e.g., “make this photo look like a Monet painting”).
  • Image recognition: Using large vision foundation models for classification and object detection.

Popular systems such as Midjourney, DALL·E 2, and Stable Diffusion are trained on large datasets of image–text pairs. Most modern systems rely on diffusion models, which generate images by gradually denoising random noise into a coherent visual representation.

Model architectures

Two main generative approaches have been used for text-to-image synthesis:

Generative Adversarial Networks (GANs)

GANs consist of a generator, which produces images from noise and text embeddings, and a discriminator, which judges whether images are real or synthetic. Through this adversarial process, GANs can produce highly realistic images, but training is often unstable and computationally expensive.

Diffusion models

Diffusion models have become the dominant approach due to improved stability and image quality. They work in two stages:

  1. Forward diffusion: Noise is gradually added to an image until it becomes unrecognizable.
  2. Reverse diffusion: The model learns to iteratively remove noise, reconstructing an image guided by the text prompt.

This process is analogous to reversing the diffusion of ink in water. An example of this denoising trajectory is shown in Figure 5, which visualizes selected steps from a 40-step generation process using Stable Diffusion.

Figure 5: European-style castle in Japan, created using the Stable Diffusion V1-5 AI diffusion model
Figure 5: European-style castle in Japan, created using the Stable Diffusion V1-5 AI diffusion model

How diffusion-based text-to-image models work

Models such as DALL·E 2, Imagen, and Stable Diffusion use a text encoder to convert prompts into embeddings. These embeddings condition a cascade of diffusion models that progressively refine a latent image representation.

A typical Stable Diffusion pipeline works as follows:

  1. Generate a random latent noise tensor.
  2. Use a U-Net to predict noise conditioned on the text embedding.
  3. Subtract predicted noise from the latent image.
  4. Repeat denoising for a fixed number of steps (e.g., 40).
  5. Decode the final latent representation into pixel space using a Variational Autoencoder (VAE).

Stable Diffusion

Stable Diffusion, developed by the CompVis group at LMU Munich, introduced latent diffusion models, significantly reducing training and inference costs compared to pixel-space diffusion. By operating in a lower-dimensional latent space, the model can run on consumer-grade GPUs while producing high-quality images.

Key components include:

  • VAE: Compresses images into latent space and decodes final outputs.
  • U-Net: Performs iterative denoising in latent space.
  • Loss function: Typically Mean Squared Error (MSE), encouraging accurate noise prediction.

The model was trained on the LAION-5B dataset, containing billions of image–text pairs scraped from sources such as Common Crawl, Flickr, and DeviantArt. Stable Diffusion’s code and weights are released under the CreativeML OpenRAIL-M license, enabling broad reuse and commercialization.

Conditioning and control

Conditioning allows text-to-image models to be guided by prompts or additional inputs such as depth maps or outlines. Text embeddings are processed by a transformer and injected into the noise prediction process, steering image generation toward the desired outcome.

Small changes in prompts, samplers, or random seeds can yield dramatically different images—highlighting both the creative power and the instability of diffusion-based generation.

What can AI do in other domains?

Generative AI has expanded well beyond text and images, showing strong capabilities across audio, music, video, and 3D content.

In the audio domain, models can synthesize natural speech, generate original music, and even mimic a speaker’s voice and prosody. Speech-to-text systems—also known as Automatic Speech Recognition (ASR)—can transcribe spoken language with near human-level accuracy.

In video, generative models can create photorealistic clips from text prompts and perform advanced editing tasks such as object removal and scene modification. In 3D, models can reconstruct scenes from images and generate complex objects directly from textual descriptions.

Table 1 summarizes representative models across audio, video, and 3D domains, highlighting their architectures and capabilities, from music generation (Jukebox) and speech recognition (Whisper, USM) to text-to-video (Phenaki, Imagen Video) and text-to-3D generation (DreamFusion).

ModelOrganizationYearDomainArchitecturePerformance
3D-GQNDeepMind20183DDeep, iterative, latent variable density models3D scene generation from 2D images
JukeboxOpenAI2020MusicVQ-VAE + transformerHigh-fidelity music generation in different styles
WhisperOpenAI2022Sound/speechTransformerNear human-level speech recognition
Imagen VideoGoogle2022VideoFrozen text transformers + video diffusion modelsHigh-definition video generation from text
PhenakiGoogle & UCL2022VideoBidirectional masked transformerRealistic video generation from text
TecoGANU. Munich2022VideoTemporal coherence moduleHigh-quality, smooth video generation
DreamFusionGoogle20223DNeRF + DiffusionHigh-fidelity 3D object generation from text
AudioLMGoogle2023Sound/speechTokenizer + transformer LM + detokenizerHigh linguistic quality speech generation maintaining speaker’s identity
AudioGenMeta AI2023Sound/speechTransformer + text guidanceHigh-quality conditional and unconditional audio generation
Universal Speech Model (USM)Google2023Sound/speechEncoder-decoder transformerState-of-the-art multilingual speech recognition
Table 1: Models for audio, video, and 3D domains

Across these applications, progress is driven by advances in deep generative architectures—including GANs, diffusion models, and transformers—with major contributions from research labs at Google, OpenAI, Meta, and DeepMind.

Conclusion

Enabled by increased computing power and advances in deep neural networks, modern generative models capture the complexity of real-world data far more effectively than earlier approaches. In this chapter, we reviewed the evolution of deep learning and generative AI, focusing on LLMs and GPT models, the Transformer architecture, and diffusion-based image generation. We also surveyed applications beyond text and images, including sound, video, and 3D content.

The next chapter turns to tooling, introducing the LangChain framework and showing how it can be used to build, extend, and operationalize LLM-based applications.