đŻÂ Learning Goals
- Define key terms in Generative AI, such as Large Language Models (LLMs), Transformers, Self-Attention, and Encoder-Decoder
- Explain how self-attention allows transformers to process entire sentences simultaneously, leading to more context-aware language generation
- Understand strategies like fine-tuning and Retrieval Augmented Generation (RAG) to improve response quality and relevance
đ Technical Vocabulary
- Generative AI
- Large Language Models (LLMs)
- Transformers
- Self-Attention
- Encoder-Decoder
Â
Warm-Up
You might remember from our prompt engineering lesson that Large Language Models (LLMs), like the ones behind chatbots and AI assistants, predict the next word in a sentence based on patterns theyâve learned from huge amounts of text. In this activity, youâll play the role of the LLM!
You'll see a sentence with one word missing at the end. Your task is to guess the most likely next word based on what makes sense in context. Think about common phrases, sentence structure, and storytelling patternsâjust like an AI would!
- The sun rises in the ___
- In the middle of the night, I heard a ___
- If you work hard, you will ___
After you make your guess, weâll discuss the best predictions and why they make sense. Letâs see how good you are at predicting wordsâmaybe youâll think like an LLM!
Generative AI
Generative AI refers to artificial intelligence systems designed to create new content, such as text, images, audio, code, or video, that wasn't explicitly programmed. Unlike systems that simply classify or analyze existing information, generative AI produces original outputs based on patterns it learned during training. These systems can write essays, generate artwork, compose music, write functional code, and even hold conversations that feel remarkably human-like.
Just like in the warm-up exercise where you predicted the next word based on your understanding of common phrases and natural language patterns, Large Language Models operate on a similar principle, but at an immense scaleâthey predict the most likely next word based on billions of examples they've seen during training. The key difference is that while you drew on your lifetime of language experience to make one prediction, these models can rapidly generate thousands of predictions to create entire paragraphs or documents.
This builds on the neural network concepts we've explored previously. The same fundamental principles of weights, activations, and learning from data apply, but with specialized architectures designed for generation rather than classification.
Many of the breakthroughs in generative AI, such as Large Language Models (LLMs), combine elements of both supervised and unsupervised learning during different stages of training. In the pre-training stage, the model learns from large datasets without explicit labels. This is unsupervised learning! The model self-learns from massive amounts of text, building a strong foundation in language understanding. Then the pre-trained model is fine-tuned using labeled datasets with clear inputs and expected outputs. This is a form of supervised learning! Finally, some LLMs include reinforcement learning, where humans rank the modelâs responses further refining the model to make it more helpful, safe, and user-friendly. In reality, these types of machine learning donât have to be isolated from one another. They can all be used together to create something awesome!
Transformers: A Transformational Technology
LLMs have become a cornerstone of modern natural language processing, with the transformer architecture driving their success. Early generative systems relied on recurrent neural networks (RNNs) and their variants, which processed language one word at a time. While these models could generate sentences, they often lost track of the broader context, making it challenging to produce coherent and contextually rich outputs. Consider this example of translating a sentence word-for-word:
âIâm looking forward to the party.â
If you translate this sentence word-for-word to Spanish, you might end up with this sentence:
âEstoy mirando adelante a la fiesta.â
Each word was translated correctly, but it doesnât make much sense in Spanish, because âlooking forward toâ is a colloquial phrase in English. This is a better translation in Spanish:
âEspero con ansias la fiesta.â
If we translate that back into English word-for-word, itâs something like this:
âI wait with craving the party.â
As you can see, processing one word at a time leads to less than stellar results. Then came the revolutionary idea of the transformerâa specialized type of neural network architecture. Instead of processing words sequentially, transformers introduced a novel concept known as self-attention. This mechanism allowed the model to weigh the importance of every word in a sentence simultaneously, capturing relationships between words regardless of their position. In fact, the title of the paper introducing the Transformer architecture was âAttention Is All You Needâ! This new architecture allowed the model to "look around" the entire sentence at once, understanding not just the local neighborhood of a word but the entire meaning and context of the word. As a result, language models became substantially more powerful!
This breakthrough changed the game in generative AI. With transformers, models could now generate text that was not only more coherent but also more contextually aware. This led to the development of large language models (LLMs) like GPT and BERT, which transformed how machines understand and produce language. These models, trained on vast amounts of unlabeled text using unsupervised learning, learned the subtle nuances of human language and can now generate creative, nuanced responses that feel almost human. Now we know where the term GPT in ChatGPT comes from: Generative Pre-trained Transformer!
Transformer Architecture
A transformer is primarily composed of two components, an encoder and a decoder.
- The encoder receives the input and creates a context-aware representation of its features. The encoder model is optimized to gather an understanding of the input.
- The decoder uses the encoderâs representation along with other inputs to generate a target sequence. The decoder model is optimized for generating outputs.
Most generative tasks use encoder-decoder models, also known as sequence-to-sequence models. Large Language Models using a transformer architecture can tackle many different kinds of NLP tasks like translation, summarization, and text generation. Letâs take a look at a simple example.
Imagine we have the beginning of a story: "Once upon a time..." and our goal is to predict what comes next. Here's how the encoder and decoder work together in this scenario:
- The encoder reads the input sequence ("Once upon a time...") and transforms it into a series of vectors. These vectors capture the meaning and relationships between the words. Essentially, the encoder builds a contextual map of the input, understanding that this phrase sets the stage for a narrative.
- The decoder takes the encoderâs contextual map and the words it has already generated to make a prediction. For example, after processing the input, the decoder might decide that the next word should be "there" based on the story's context. The output of a decoder is a probability distribution for a series of words that might follow the current sequence. For this example, that might look something like this:
- "there" â 65% ("Once upon a time, there wasâŚâ)
- "lived" â 15% ("Once upon a time lived a...â)
- "in" â 10% ("Once upon a time in a faraway land...â)
- "a" â 5% ("Once upon a time a king ruled...â)
In short, while the encoder focuses on understanding the given phrase, the decoder uses that understanding to continue the story, one word at a time.
Improving Model Performance
While itâs possible to pre-train a large language model from scratch, it would require significant time and computing resources. Remember that Large Language Models are typically trained on a very large corpus of data, which can take up to several weeks to complete and can have a huge environmental impact! Check out this video of Sasha Luccioni from the Hugging Face team talking about the carbon footprint of Transformers.

Based on this information, itâs easy to see why sharing pre-trained models is so important. Instead of starting over, we can build on existing models, reducing costs and cutting down on AIâs carbon footprint.
Want to measure the environmental impact of training a model? Check out tools like ML CO2 Impact and Code Carbon. For a deeper dive, read this blog post on tracking emissions or explore the đ¤ Transformers documentation.
Thankfully, there are several low-cost ways to improve the performance of a model. You heard Sasha mention one of them in the videoâfine-tuning!
Fine-Tuning
Fine-tuning a model is when you perform additional training with a dataset specific to your task after a model has been pre-trained. For NLP tasks, a pre-trained model will have some kind of statistical understanding of the language you are using for your task, so this is a good place to start! The fine-tuning requires way less data to get decent results since the pre-trained model was already trained on heaps of data.
Essentially, fine-tuning a model involves reconfiguring the last layers of the neural network, adjusting some specific parameters based on a specific dataset that's related to your use-case. Although fine-tuning requires significantly less compute resources than pre-training (training a model from scratch), itâs not inconsequential. Often, fine-tuning involves dealing with large data sets and tuning many parameters, all of which might be challenging with limited compute resources. For this lesson, weâll stick with some other strategies for improving our chatbots.
Prompt Engineering
Remember when we talked about crafting and refining your prompt to improve the response? You can improve your chatbot using similar strategies by modifying the system message that you provide the model! The system message sets the overall behavior, tone, and priorities of the model. Here are a few options for ways to change the system message that would improve its performance:
- Give the model a âjobâ or âroleâ â âYou are a knowledgeable and patient AI tutor specializing in high school computer science.â
- Limit number of words â âKeep responses under 100 words unless explicitly asked for more detail.â
- Provide a single example with one-shot prompting â âProvide clear explanations with examples. Follow this format: User: âWhat is a function?â AI: âA function in Python is a reusable block of code that performs a specific task. Hereâs an example:
def greet(name): return f'Hello, {name}!' print(greet('Alice'))This function takes a name as input and returns a greeting.ââ
Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation or RAG leverages a pre-trained model alongside a retrieval system to fetch relevant documents on the fly. This means you don't need to fine-tune an entire large language model, which can be computationally expensive and may require GPUs or cloud resources. In this model, you provide the LLM with a knowledge base, a collection of documents and resources from which the AI model retrieves relevant data to improve its responses.
The main purpose of the knowledge base is to provide up-to-date or specialized information that likely would not have been included in the pre-trained models, parameters or training data. By retrieving relevant documents, the model can generate answers that are better supported by factual evidence, which can reduce hallucinations. The knowledge base allows the model to incorporate detailed context from specific domains, making the generated responses more accurate and tailored.
Think About It
RAG is especially useful in situations where accessing up-to-date, specialized, or extensive external information is critical. What are some real-world examples where RAG might be helpful?
RAG is especially useful in situations where accessing up-to-date, specialized, or extensive external information is critical. What are some real-world examples where RAG might be helpful?
In the final project, youâll see an example of RAG at work! Youâll include relevant information in the knowledge base and use an embedding model to pull in relevant information for your projectâs topic.
Improving Your Chatbot
Next up, weâll continue improving your chatbot by incorporating an LLM and applying strategies to improve the generated responses. Weâll explore the transformer architecture that underlies many of the LLMs you interact with, and weâll take practical steps to improve your chatbot. Letâs dig in!
A Quick Note About APIs
An API (Application Programming Interface) is a set of rules that allows different software programs to communicate with each other. Think of it like a messenger that delivers your request to a system and returns a response. Weâll be using a Hugging Face API to make requests to a model and the API will send us the modelâs response. Since these requests happen over the internet, youâll want to add something called a token to your Hugging Face Space. Itâs like a digital key that grants access to a service.
Create a Hugging Face Token
To create a token, follow the steps below:
- Click on your avatar in the top right corner of Hugging Face and select âAccess Tokensâ from the menu.
- Click the âCreate new tokenâ button.
- Choose âWriteâ for the token type and give your token a name:
HF_TOKEN.
- Click the âCreate tokenâ button.
- Copy your token and save it somewhere safe! It will disappear after you close this window! Once youâve saved the token, click âDoneâ.
Using a token ensures API calls are authenticated and secure. đ
Code-Along | Connect Your Chatbot to an LLMYour completed
Continue working with the same chatbot you built in the previous lesson. Follow these steps to update your chatbot to fetch responses from an LLM!
- To add your token to your Hugging Face Space, open the Settings by clicking on the three dots. Scroll down to âVariables and secretsâ and click âNew secretâ. Use âHF_TOKENâ for the name of your secret and paste your token in the value field.
- Next, weâll use the InferenceClient class from Hugging Face to connect to and use a model hosted on Hugging Face. Add this line at the top of
app.pynext to the other import statements:from huggingface_hub import InferenceClient
- Create an instance of
InferenceClientconnected to the âmicrosoft/phi-4â text-generation model. This client will handle making requests to the model to generate responses:client = InferenceClient("microsoft/phi-4")
- Within the
respond()function, initialize a list of dictionaries to store the messages. Each dictionary includesroleandcontentkeys:messages = [{"role": "system", "content": "You are a friendly chatbot."}]
- Add a conditional statement to add all previous conversation messages to the messages list only if there is conversation history provided:
if history: messages.extend(history)
- Then add the current userâs message to the messages list:
messages.append({"role": "user", "content": message})
- Now we can make the chat completion API call, sending the messages and other parameters to the model:
response = client.chat_completion( messages, max_tokens=100 )
- Finally, weâll extract and return the chatbotâs response:
return response['choices'][0]['message']['content'].strip()
Your completed app.py file should look like this:
import gradio as gr from huggingface_hub import InferenceClient client = InferenceClient("microsoft/phi-4") def respond(message, history): messages = [{"role": "system", "content": "You are a friendly chatbot."}] if history: messages.extend(history) messages.append({"role": "user", "content": message}) response = client.chat_completion( messages, max_tokens=100 ) return response['choices'][0]['message']['content'].strip() chatbot = gr.ChatInterface(respond) chatbot.launch()
Now that we've connected our chatbot to an LLM using the InferenceClient, let's explore how we can fine-tune its responses. Remember how transformers process context to generate probabilities for the next words? We can influence this generation process by adjusting parameters like
max_tokens and temperature. These parameters directly impact how the model applies its attention mechanism and makes token predictions. Let's experiment with these settings to see how they affect our chatbot's responses.Try-It | More Parameters
We passed in two parameters to
chat_completion(), but there are more options! Try increasing or decreasing the value for the max_tokens parameter. Next, try passing in a decimal value between 0 and 2 for temperature or a value between 0 and 1 for top_p. These are hyperparameters that control how straightforward and predictable the response is. Read more about them here!đśď¸Â Mild Challenge: You can also change which model your API call is using! We used âmicrosoft/phi-4â, but there are other compatible models shared on the đ¤Â Hub! Try one of the following options:
- google/gemma-2-2b-it (You'll need to request permission first, because this is a gated model!)
Note: You may run into errors when you're trying different models. To see the error messages, setÂ
debug to True in launch(). Read through them (often the last message is the most helpful) and see if you can correct the issue. You'll also notice that DeepSeek's model also returns all of the reasoning before the response. Interesting!Adjusting parameters gives us control over the quality and style of responses! Now letâs look at another way to improve the chatbot. Users often prefer to see responses appear gradually rather than waiting for the complete answer. This incremental generation mimics how transformer models naturally workâpredicting one token at a time using self-attention across the existing context. By implementing streaming, we can expose this step-by-step generation process to users, creating a more engaging experience that reveals how the model builds context as it generates text.
Try-It | Stream the Response
Use a
for loop and replace return with yield to stream the response.- Initialize the
responsevariable as an empty string.
- Pass
stream=Trueto theclient.chat_completionmethod.
- Write a for loop to iterate through each
messageof theclient.chat_completionmethod.
- Within the code block of the
forloop: - Capture the most recent token:
token = message.choices[0].delta.content - Add it to the response:
response += token - And
yieldthe response:yield response
Hint: You will no longer
return the response!Now letâs look at another way to improve the quality of those responses. Remember how we discussed that transformers rely heavily on context? The system message provides crucial initial context that shapes how the model applies its self-attention mechanisms. Just as the encoder component of a transformer creates a context-aware representation of inputs, our system message creates a foundation for the model's understanding of its role. By crafting a more specific system message, we can guide the model's responses without the computational expense of fine-tuning the entire network.
Practice | Improve the System Message
Itâs time to give your chatbot a specific role and personality! Choose a specialty for your chatbot, then craft a system message.
- Choose a role for your chatbot (examples: fitness coach, study buddy, travel advisor, recipe assistant, book recommender, career counselor).
- Write a system message that defines your chatbot's role using at least one strategy:
- Give the model a "job" or "role"
- Limit the number of words
- Provide a one-shot example format
- Test your chatbot with 3-5 questions and refine your system message based on the responses.
Example System Message: âYou are a patient homework helper for high school students. Break down complex concepts into simple steps using everyday analogies. Keep responses under 100 words. Guide students to solutions. Never give direct answers.â
đź Takeaways
In this lesson, we explored how generative AI works, focusing on the transformer model, which helps AI understand and predict text more effectively.
- The transformer architecture overcomes limitations of older sequential models by processing words in context, resulting in more coherent and contextually accurate outputs.
- Using APIs (e.g., Hugging Faceâs
InferenceClient) allows seamless integration of LLMs into applications like chatbots, making advanced NLP accessible and practical.
- Adjusting system messages and fine-tuning pre-trained models are strategies for tailoring chatbot behavior and improving response quality without extensive re-training.
For a summary of this lesson, check out the 11. Generative AI One-Pager!
Â