11. Generative AI

🎯 Learning Goals

Define key terms in Generative AI, such as Large Language Models (LLMs), Transformers, Self-Attention, and Encoder-Decoder

Explain how self-attention allows transformers to process entire sentences simultaneously, leading to more context-aware language generation

Understand strategies like fine-tuning and Retrieval Augmented Generation (RAG) to improve response quality and relevance

📗 Technical Vocabulary

Generative AI

Large Language Models (LLMs)

Transformers

Self-Attention

Encoder-Decoder

🌤️

Warm-Up

You might remember from our prompt engineering lesson that Large Language Models (LLMs), like the ones behind chatbots and AI assistants, predict the next word in a sentence based on patterns they’ve learned from huge amounts of text. In this activity, you’ll play the role of the LLM!

You'll see a sentence with one word missing at the end. Your task is to guess the most likely next word based on what makes sense in context. Think about common phrases, sentence structure, and storytelling patterns—just like an AI would!

The sun rises in the ___

In the middle of the night, I heard a ___

If you work hard, you will ___

After you make your guess, we’ll discuss the best predictions and why they make sense. Let’s see how good you are at predicting words—maybe you’ll think like an LLM!

Generative AI

Generative AI refers to artificial intelligence systems designed to create new content, such as text, images, audio, code, or video, that wasn't explicitly programmed. Unlike systems that simply classify or analyze existing information, generative AI produces original outputs based on patterns it learned during training. These systems can write essays, generate artwork, compose music, write functional code, and even hold conversations that feel remarkably human-like.

Just like in the warm-up exercise where you predicted the next word based on your understanding of common phrases and natural language patterns, Large Language Models operate on a similar principle, but at an immense scale—they predict the most likely next word based on billions of examples they've seen during training. The key difference is that while you drew on your lifetime of language experience to make one prediction, these models can rapidly generate thousands of predictions to create entire paragraphs or documents.

This builds on the neural network concepts we've explored previously. The same fundamental principles of weights, activations, and learning from data apply, but with specialized architectures designed for generation rather than classification.

Many of the breakthroughs in generative AI, such as Large Language Models (LLMs), combine elements of both supervised and unsupervised learning during different stages of training. In the pre-training stage, the model learns from large datasets without explicit labels. This is unsupervised learning! The model self-learns from massive amounts of text, building a strong foundation in language understanding. Then the pre-trained model is fine-tuned using labeled datasets with clear inputs and expected outputs. This is a form of supervised learning! Finally, some LLMs include reinforcement learning, where humans rank the model’s responses further refining the model to make it more helpful, safe, and user-friendly. In reality, these types of machine learning don’t have to be isolated from one another. They can all be used together to create something awesome!

Transformers: A Transformational Technology

LLMs have become a cornerstone of modern natural language processing, with the transformer architecture driving their success. Early generative systems relied on recurrent neural networks (RNNs) and their variants, which processed language one word at a time. While these models could generate sentences, they often lost track of the broader context, making it challenging to produce coherent and contextually rich outputs. Consider this example of translating a sentence word-for-word:

“I’m looking forward to the party.”

If you translate this sentence word-for-word to Spanish, you might end up with this sentence:

“Estoy mirando adelante a la fiesta.”

Each word was translated correctly, but it doesn’t make much sense in Spanish, because “looking forward to” is a colloquial phrase in English. This is a better translation in Spanish:

“Espero con ansias la fiesta.”

If we translate that back into English word-for-word, it’s something like this:

“I wait with craving the party.”

As you can see, processing one word at a time leads to less than stellar results. Then came the revolutionary idea of the transformer—a specialized type of neural network architecture. Instead of processing words sequentially, transformers introduced a novel concept known as self-attention. This mechanism allowed the model to weigh the importance of every word in a sentence simultaneously, capturing relationships between words regardless of their position. In fact, the title of the paper introducing the Transformer architecture was “Attention Is All You Need”! This new architecture allowed the model to "look around" the entire sentence at once, understanding not just the local neighborhood of a word but the entire meaning and context of the word. As a result, language models became substantially more powerful!

This breakthrough changed the game in generative AI. With transformers, models could now generate text that was not only more coherent but also more contextually aware. This led to the development of large language models (LLMs) like GPT and BERT, which transformed how machines understand and produce language. These models, trained on vast amounts of unlabeled text using unsupervised learning, learned the subtle nuances of human language and can now generate creative, nuanced responses that feel almost human. Now we know where the term GPT in ChatGPT comes from: Generative Pre-trained Transformer!

Transformer Architecture

A transformer is primarily composed of two components, an encoder and a decoder.

The encoder receives the input and creates a context-aware representation of its features. The encoder model is optimized to gather an understanding of the input.

The decoder uses the encoder’s representation along with other inputs to generate a target sequence. The decoder model is optimized for generating outputs.

Most generative tasks use encoder-decoder models, also known as sequence-to-sequence models. Large Language Models using a transformer architecture can tackle many different kinds of NLP tasks like translation, summarization, and text generation. Let’s take a look at a simple example.

Imagine we have the beginning of a story: "Once upon a time..." and our goal is to predict what comes next. Here's how the encoder and decoder work together in this scenario:

The encoder reads the input sequence ("Once upon a time...") and transforms it into a series of vectors. These vectors capture the meaning and relationships between the words. Essentially, the encoder builds a contextual map of the input, understanding that this phrase sets the stage for a narrative.

The decoder takes the encoder’s contextual map and the words it has already generated to make a prediction. For example, after processing the input, the decoder might decide that the next word should be "there" based on the story's context. The output of a decoder is a probability distribution for a series of words that might follow the current sequence. For this example, that might look something like this:

"there" → 65% ("Once upon a time, there was…”)
"lived" → 15% ("Once upon a time lived a...”)
"in" → 10% ("Once upon a time in a faraway land...”)
"a" → 5% ("Once upon a time a king ruled...”)

In short, while the encoder focuses on understanding the given phrase, the decoder uses that understanding to continue the story, one word at a time.

Improving Model Performance

While it’s possible to pre-train a large language model from scratch, it would require significant time and computing resources. Remember that Large Language Models are typically trained on a very large corpus of data, which can take up to several weeks to complete and can have a huge environmental impact! Check out this video of Sasha Luccioni from the Hugging Face team talking about the carbon footprint of Transformers.

Based on this information, it’s easy to see why sharing pre-trained models is so important. Instead of starting over, we can build on existing models, reducing costs and cutting down on AI’s carbon footprint.

Want to measure the environmental impact of training a model? Check out tools like ML CO2 Impact and Code Carbon. For a deeper dive, read this blog post on tracking emissions or explore the 🤗 Transformers documentation.

Thankfully, there are several low-cost ways to improve the performance of a model. You heard Sasha mention one of them in the video—fine-tuning!

Fine-Tuning

Fine-tuning a model is when you perform additional training with a dataset specific to your task after a model has been pre-trained. For NLP tasks, a pre-trained model will have some kind of statistical understanding of the language you are using for your task, so this is a good place to start! The fine-tuning requires way less data to get decent results since the pre-trained model was already trained on heaps of data.

Essentially, fine-tuning a model involves reconfiguring the last layers of the neural network, adjusting some specific parameters based on a specific dataset that's related to your use-case. Although fine-tuning requires significantly less compute resources than pre-training (training a model from scratch), it’s not inconsequential. Often, fine-tuning involves dealing with large data sets and tuning many parameters, all of which might be challenging with limited compute resources. For this lesson, we’ll stick with some other strategies for improving our chatbots.

Prompt Engineering

Remember when we talked about crafting and refining your prompt to improve the response? You can improve your chatbot using similar strategies by modifying the system message that you provide the model! The system message sets the overall behavior, tone, and priorities of the model. Here are a few options for ways to change the system message that would improve its performance:

Give the model a “job” or “role” — “You are a knowledgeable and patient AI tutor specializing in high school computer science.”

Limit number of words — “Keep responses under 100 words unless explicitly asked for more detail.”

Provide a single example with one-shot prompting — “Provide clear explanations with examples. Follow this format: User: ‘What is a function?’ AI: ‘A function in Python is a reusable block of code that performs a specific task. Here’s an example: def greet(name): return f'Hello, {name}!' print(greet('Alice')) This function takes a name as input and returns a greeting.’”

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation or RAG leverages a pre-trained model alongside a retrieval system to fetch relevant documents on the fly. This means you don't need to fine-tune an entire large language model, which can be computationally expensive and may require GPUs or cloud resources. In this model, you provide the LLM with a knowledge base, a collection of documents and resources from which the AI model retrieves relevant data to improve its responses.

The main purpose of the knowledge base is to provide up-to-date or specialized information that likely would not have been included in the pre-trained models, parameters or training data. By retrieving relevant documents, the model can generate answers that are better supported by factual evidence, which can reduce hallucinations. The knowledge base allows the model to incorporate detailed context from specific domains, making the generated responses more accurate and tailored.

💭

Think About It

RAG is especially useful in situations where accessing up-to-date, specialized, or extensive external information is critical. What are some real-world examples where RAG might be helpful?

In the final project, you’ll see an example of RAG at work! You’ll include relevant information in the knowledge base and use an embedding model to pull in relevant information for your project’s topic.

Improving Your Chatbot

Next up, we’ll continue improving your chatbot by incorporating an LLM and applying strategies to improve the generated responses. We’ll explore the transformer architecture that underlies many of the LLMs you interact with, and we’ll take practical steps to improve your chatbot. Let’s dig in!

A Quick Note About APIs

An API (Application Programming Interface) is a set of rules that allows different software programs to communicate with each other. Think of it like a messenger that delivers your request to a system and returns a response. We’ll be using a Hugging Face API to make requests to a model and the API will send us the model’s response. Since these requests happen over the internet, you’ll want to add something called a token to your Hugging Face Space. It’s like a digital key that grants access to a service.

Create a Hugging Face Token

To create a token, follow the steps below:

Click on your avatar in the top right corner of Hugging Face and select “Access Tokens” from the menu.

Click the “Create new token” button.

Choose “Write” for the token type and give your token a name: HF_TOKEN.

Click the “Create token” button.

Copy your token and save it somewhere safe! It will disappear after you close this window! Once you’ve saved the token, click “Done”.

Using a token ensures API calls are authenticated and secure. 🔒

⌨️

Code-Along | Connect Your Chatbot to an LLM

Continue working with the same chatbot you built in the previous lesson. Follow these steps to update your chatbot to fetch responses from an LLM!

To add your token to your Hugging Face Space, open the Settings by clicking on the three dots. Scroll down to “Variables and secrets” and click “New secret”. Use “HF_TOKEN” for the name of your secret and paste your token in the value field.

Next, we’ll use the InferenceClient class from Hugging Face to connect to and use a model hosted on Hugging Face. Add this line at the top of app.py next to the other import statements: from huggingface_hub import InferenceClient

Create an instance of InferenceClient connected to the “microsoft/phi-4” text-generation model. This client will handle making requests to the model to generate responses: client = InferenceClient("microsoft/phi-4")

Within the respond() function, initialize a list of dictionaries to store the messages. Each dictionary includes role and content keys: messages = [{"role": "system", "content": "You are a friendly chatbot."}]

Add a conditional statement to add all previous conversation messages to the messages list only if there is conversation history provided:


if history:
    messages.extend(history)

Then add the current user’s message to the messages list: messages.append({"role": "user", "content": message})

Now we can make the chat completion API call, sending the messages and other parameters to the model:


response = client.chat_completion(
    messages,
    max_tokens=100
)

Finally, we’ll extract and return the chatbot’s response: return response['choices'][0]['message']['content'].strip()

Your completed app.py file should look like this:


import gradio as gr
from huggingface_hub import InferenceClient

client = InferenceClient("microsoft/phi-4")

def respond(message, history):
    
    messages = [{"role": "system", "content": "You are a friendly chatbot."}]
    
    if history:
        messages.extend(history)
        
    messages.append({"role": "user", "content": message})
    
    response = client.chat_completion(
        messages,
        max_tokens=100
    )
    
    return response['choices'][0]['message']['content'].strip()

chatbot = gr.ChatInterface(respond)

chatbot.launch()

Now that we've connected our chatbot to an LLM using the InferenceClient, let's explore how we can fine-tune its responses. Remember how transformers process context to generate probabilities for the next words? We can influence this generation process by adjusting parameters like max_tokens and temperature. These parameters directly impact how the model applies its attention mechanism and makes token predictions. Let's experiment with these settings to see how they affect our chatbot's responses.

✏️

Try-It | More Parameters

We passed in two parameters to chat_completion(), but there are more options! Try increasing or decreasing the value for the max_tokens parameter. Next, try passing in a decimal value between 0 and 2 for temperature or a value between 0 and 1 for top_p. These are hyperparameters that control how straightforward and predictable the response is. Read more about them here!

🌶️ Mild Challenge: You can also change which model your API call is using! We used “microsoft/phi-4”, but there are other compatible models shared on the 🤗 Hub! Try one of the following options:

google/gemma-2-2b-it (You'll need to request permission first, because this is a gated model!)

Qwen/Qwen2.5-7B-Instruct-1M

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Note: You may run into errors when you're trying different models. To see the error messages, set debug to True in launch(). Read through them (often the last message is the most helpful) and see if you can correct the issue. You'll also notice that DeepSeek's model also returns all of the reasoning before the response. Interesting!

Adjusting parameters gives us control over the quality and style of responses! Now let’s look at another way to improve the chatbot. Users often prefer to see responses appear gradually rather than waiting for the complete answer. This incremental generation mimics how transformer models naturally work—predicting one token at a time using self-attention across the existing context. By implementing streaming, we can expose this step-by-step generation process to users, creating a more engaging experience that reveals how the model builds context as it generates text.

✏️

Try-It | Stream the Response

Use a for loop and replace return with yield to stream the response.

Initialize the response variable as an empty string.

Pass stream=True to the client.chat_completion method.

Write a for loop to iterate through each message of the client.chat_completion method.

Within the code block of the for loop:

Capture the most recent token: token = message.choices[0].delta.content

Add it to the response: response += token

And yield the response: yield response

Hint: You will no longer return the response!

Now let’s look at another way to improve the quality of those responses. Remember how we discussed that transformers rely heavily on context? The system message provides crucial initial context that shapes how the model applies its self-attention mechanisms. Just as the encoder component of a transformer creates a context-aware representation of inputs, our system message creates a foundation for the model's understanding of its role. By crafting a more specific system message, we can guide the model's responses without the computational expense of fine-tuning the entire network.

📝

Practice | Improve the System Message

It’s time to give your chatbot a specific role and personality! Choose a specialty for your chatbot, then craft a system message.

Choose a role for your chatbot (examples: fitness coach, study buddy, travel advisor, recipe assistant, book recommender, career counselor).

Write a system message that defines your chatbot's role using at least one strategy:

Give the model a "job" or "role"

Limit the number of words

Provide a one-shot example format

Test your chatbot with 3-5 questions and refine your system message based on the responses.

Example System Message: “You are a patient homework helper for high school students. Break down complex concepts into simple steps using everyday analogies. Keep responses under 100 words. Guide students to solutions. Never give direct answers.”

💼 Takeaways

In this lesson, we explored how generative AI works, focusing on the transformer model, which helps AI understand and predict text more effectively.

The transformer architecture overcomes limitations of older sequential models by processing words in context, resulting in more coherent and contextually accurate outputs.

Using APIs (e.g., Hugging Face’s InferenceClient) allows seamless integration of LLMs into applications like chatbots, making advanced NLP accessible and practical.

Adjusting system messages and fine-tuning pre-trained models are strategies for tailoring chatbot behavior and improving response quality without extensive re-training.

For a summary of this lesson, check out the 11. Generative AI One-Pager!