Large Language Models (LLMs) & Foundation Models¶

1. What is a Large Language Model (LLM)?¶

A Large Language Model is a type of AI model trained on massive amounts of text data to understand and generate human language.

“Large” refers to two things: - Large data: trained on hundreds of billions of words (books, websites, code, articles) - Large model: billions of internal parameters (weights) that encode knowledge

Examples: Google Gemini, GPT-4, Claude, LLaMA, Mistral.

What can LLMs do?¶

Generate text (articles, emails, code, summaries)
Answer questions
Translate languages
Classify or analyze text
Hold conversations
Reason through problems step by step

Key insight¶

LLMs don’t “know” things the way humans do. They are extremely good at predicting what text should come next given a context — and this surprisingly general capability leads to intelligent-seeming behavior.

2. How are LLMs Trained?¶

Training an LLM happens in multiple phases:

Phase 1 — Pre-training (Self-Supervised Learning)¶

The model is trained on a huge corpus of raw text (no human labels needed).

Task: Predict the next token (word/subword) given the previous ones.

Input:  "The capital of France is ___"
Target: "Paris"

The model processes billions of such examples and adjusts its billions of parameters to get better at this task. Over time, it implicitly learns grammar, facts, reasoning patterns, and world knowledge.

This phase requires enormous computing power (thousands of GPUs/TPUs, weeks of training)
Google uses TPUs (Tensor Processing Units) — custom chips optimized for this workload

Phase 2 — Fine-tuning (Supervised Learning)¶

After pre-training, the model is further trained on a smaller, curated dataset for a specific task or behavior.

Example: Fine-tuning on question-answer pairs so the model responds helpfully to user questions rather than just completing text.

On Google Cloud, this is done via Vertex AI supervised tuning.

Phase 3 — RLHF (Reinforcement Learning from Human Feedback)¶

Human raters evaluate model outputs and score them. The model is trained to maximize high-quality, safe, and helpful responses.

Model generates answer → Human rates it → Model improves toward higher-rated outputs

This is what makes LLMs feel “aligned” — they’ve learned not just to be accurate, but to be helpful, harmless, and honest.

3. The Transformer Architecture (Simplified)¶

All modern LLMs are built on the Transformer architecture (introduced by Google in 2017 in the paper “Attention Is All You Need”).

The key innovation is the attention mechanism — the model learns which parts of the input to “pay attention to” when generating each output token.

Input tokens → Embeddings → Multiple Transformer Layers → Output tokens
                                  ↑
                          (Attention mechanism
                           learns relationships
                           between all tokens)

You don’t need to understand the math for GAIL, but you should know: - Transformers are the architecture behind LLMs - They handle long-range dependencies in text (e.g., a pronoun referring to a noun several sentences back) - The more layers and parameters, the more capable (and expensive) the model

4. What are Foundation Models?¶

A Foundation Model is a large model trained on broad, general data that can be adapted to a wide range of downstream tasks.

The term was coined by Stanford researchers in 2021. It describes models like Gemini, GPT-4, and Claude.

Key Characteristics of Foundation Models¶

Characteristic	Description
Scale	Trained on massive datasets with billions of parameters
Generality	Not trained for one specific task — adaptable to many
Emergent capabilities	Abilities that weren’t explicitly trained (e.g., reasoning, code generation)
Transfer learning	Knowledge from pre-training transfers to new tasks with minimal additional training
Multimodal potential	Many can handle text, images, audio, and video

The “Foundation” Metaphor¶

Think of a foundation model like a pre-built foundation for a house. Instead of laying every brick yourself (training from scratch), you build on top of this foundation — adapting it for your specific use case through fine-tuning or prompting.

5. Foundation Models vs Traditional ML Models¶

This is a common exam topic. The contrast is important.

	Traditional ML Model	Foundation Model
Training data	Small, task-specific, labeled dataset	Massive, general, mostly unlabeled
Task scope	One specific task (e.g., classify emails)	General-purpose, adaptable to many tasks
Reusability	Hard to reuse for other tasks	Designed to be reused and adapted
Training cost	Low to moderate	Extremely high (millions of $$$)
Adaptation method	Re-train from scratch	Prompting or fine-tuning
Data requirements	Needs clean, labeled data	Can learn from raw, unlabeled data
Examples	Spam classifier, fraud detector	Gemini, GPT-4, Claude

Practical implication for GAIL¶

Organizations don’t train foundation models — they use them. Google provides foundation models via Vertex AI Model Garden and Gemini APIs. Companies then adapt them with prompting, grounding, or fine-tuning.

6. Types of Generative AI Models¶

The GAIL exam requires you to distinguish between three main model families:

6.1 Large Language Models (LLMs)¶

Input/Output: Text → Text (primarily)
How they work: Transformer-based, trained to predict the next token
Strengths: Language understanding, generation, reasoning, Q&A, summarization, code
Examples: Gemini Pro, GPT-4, Claude, LLaMA

Use cases: Chatbots, document summarization, code assistants, translation, content generation.

6.2 Diffusion Models¶

Input/Output: Noise → Image (or Audio/Video)
How they work: They learn to progressively remove noise from a random signal to reconstruct a meaningful output (like an image)

Random noise → [Denoising steps] → Final image

The training process works in reverse: take a real image, gradually add noise until it’s pure static, then teach the model to reverse this process.

Strengths: High-quality image, audio, and video generation
Examples: Stable Diffusion, DALL-E, Google Imagen, Sora (video)

Use cases: Image generation, image editing, video synthesis, product design, art creation.

Key difference from LLMs: Diffusion models generate visual/audio media, not text. They operate in pixel/latent space, not token space.

6.3 Multimodal Models¶

Input/Output: Multiple modalities (text + image + audio + video)
How they work: Combine different model architectures (e.g., vision encoder + language model) so the model can reason across modalities

Input:  [Image of a chart] + "Summarize this data"
Output: "The chart shows revenue grew 40% in Q3..."

Strengths: Cross-modal understanding and generation
Examples: Gemini 1.5/2.0 (Google’s flagship multimodal model), GPT-4o, Claude 3

Use cases: - Describing or analyzing images - Answering questions about documents with charts - Generating images from text descriptions - Video understanding and Q&A - Audio transcription and analysis

Key insight for GAIL: Gemini is Google’s primary multimodal foundation model. Knowing that Gemini can handle text, images, audio, video, and code in a single model is a core exam point.

7. Side-by-Side Comparison¶

	LLM	Diffusion Model	Multimodal Model
Primary input	Text	Text prompt (for image gen)	Text + Image + Audio + Video
Primary output	Text	Image / Video / Audio	Any combination
Architecture	Transformer	U-Net / Diffusion process	Hybrid (Transformer + Vision encoder)
Google example	Gemini (text mode)	Imagen	Gemini 1.5/2.0 Pro
Typical use	Chat, summarization	Image generation	Document Q&A, visual reasoning

8. Token: The Fundamental Unit of LLMs¶

Everything in an LLM revolves around tokens — the basic units of text the model processes.

A token is roughly ¾ of a word on average
“ChatGPT is great” ≈ 4–5 tokens
LLMs have a context window — the maximum number of tokens they can process at once

Model	Approximate Context Window
Gemini 1.0 Pro	32,000 tokens
Gemini 1.5 Pro	1,000,000 tokens (~750,000 words)
Gemini 2.0 Flash	1,000,000 tokens

Why context window matters: A larger context window means the model can “remember” longer conversations, process full documents, and reason over more information at once.

9. Emergent Capabilities¶

One of the most surprising properties of large foundation models is emergence — capabilities that appear only at scale and were never explicitly trained.

Examples of emergent capabilities in LLMs: - Multi-step reasoning — solving math problems step by step - In-context learning — learning from examples given in the prompt (few-shot) - Code generation — writing functional code without being explicitly trained for it - Language translation — without being a dedicated translation model

Key insight: Nobody programmed these capabilities directly. They emerged from scale. This is both what makes foundation models powerful and what makes them unpredictable.

10. Key Vocabulary Cheat Sheet¶

Term	Definition
Token	Basic unit of text an LLM processes (~¾ of a word)
Context window	Max tokens a model can process in one request
Parameter	Internal value the model learns during training (billions in LLMs)
Embedding	A numerical vector representation of text/image meaning
Transformer	The neural network architecture powering modern LLMs
Attention mechanism	How a Transformer weighs the importance of each token relative to others
Pre-training	Initial training on massive unlabeled data
Fine-tuning	Further training on smaller, task-specific data
RLHF	Reinforcement Learning from Human Feedback — aligns models to human preferences
Foundation model	Large, general-purpose model adaptable to many tasks
Multimodal	A model that handles multiple input/output types (text, image, audio, video)
Diffusion model	Generates images/video by learning to reverse a noise process
Emergent capability	Ability that appears at scale but wasn’t explicitly trained
Latency	Time to receive the first response from the model
Throughput	Number of tokens generated per second

11. Google’s LLM / Foundation Model Ecosystem (for GAIL)¶

Model / Product	Type	Where to access
Gemini 2.0 Flash	Multimodal LLM (fast, efficient)	Vertex AI, AI Studio
Gemini 1.5 Pro	Multimodal LLM (long context)	Vertex AI, AI Studio
Imagen 3	Diffusion model (image generation)	Vertex AI
Chirp	Audio / Speech model	Vertex AI
Codey	Code-specialized LLM	Vertex AI
Vertex AI Model Garden	Catalog of Google + third-party models	Vertex AI