8. Sentiment Analysis Lab

🎯 Learning Goals

Define sentiment analysis and discuss its real-world applications (e.g., social media monitoring, customer feedback)

Use Python libraries to perform basic text preprocessing tasks such as tokenization, stop word removal, and vectorization

Build a logistic regression model to classify simple text statements as positive or negative

Use accuracy measure to assess the performance of a sentiment analysis model

📗 Technical Vocabulary

Natural Language Processing (NLP)

Sentiment Analysis

Tokenization

Stop Words

Preprocessing

Vectors

Logistic Regression

🌤️

Warm-Up

Let’s play a quick game of sentiment analysis! For each message, identify the sentiment as positive or negative.

"I just had the best coffee ever! Totally made my day ☕️😊”

“Missed my bus again and now I'm late to class. Mondays, am I right? 😩”

“Just got my math test back and aced it! Feeling unstoppable today!”

"Oh great, another school project due on Friday. Just what I needed to make my week even more exciting! 🙄”

What words, phrases, or emojis signaled the sentiment? How did punctuation or context influence your decision? How might you teach a machine to identify these same features?

As you can probably guess, teaching a machine to extract meaning from text is slightly more complicated than teaching a machine to predict if a student will pass or fail based on the number of hours they studied. In this lesson, we’ll explore how we can teach a machine to understand the emotional tone in text—using Python to preprocess data and logistic regression to classify statements as positive or negative.

NLP & Sentiment Analysis

Natural Language Processing (NLP) is a field of computer science and artificial intelligence that focuses on teaching computers to understand human language. Today, we’ll dive into the exciting field of NLP by learning how machines understand language through a hands-on exercise with sentiment analysis. The objective of sentiment analysis is to predict whether a message has a positive or negative sentiment.

As you might imagine, there are many different approaches to this task! Some methods use deep learning neural networks, others rely on pre-built dictionaries of positive and negative words, and some combine multiple techniques. Today, we'll use a supervised machine learning approach with feature extraction, a foundational method in sentiment analysis, to gain an understanding of how machines make sense of human language.

Let’s take a look at an example:

“Machine learning is the best!”

It’s clear to us that the sentiment is positive, but how do we teach a machine to recognize this? While this example is straightforward, real-world sentiment analysis faces many challenges. Sarcasm, mixed emotions, context-dependent meaning, and cultural differences all make this task more complex than it might initially appear.

To teach a machine to recognize sentiment, you'll follow these steps:

Start with Labeled Data: Gather a set of messages where positive ones are labeled with a "1" and negative ones with a "0."

Extract Features: Process these messages to pull out important clues—like key words or phrases—that help tell if a message is positive or negative.

Train your model: Match these clues to the correct labels and adjust the model so its predictions are as accurate as possible.

Classify New Messages: Once your model is trained, you can use it to decide if new messages, like the example above, have a positive or negative sentiment.

Creating a Frequency Dictionary

Today, we're going to explore one way to turn text into a vector, which is just a fancy way of saying we'll convert words into numbers so that a computer can understand them. First, you'll build a vocabulary, which is a list of all the unique words from a group of messages. Let’s imagine this is our list of messages:

"I’m so happy because I finally beat that level!”

“I’m so happy!”

“I’m sad.”

“I’m sad because I can’t beat that level.”

From these messages, we can build our vocabulary of only unique words. That might look something like this:


# vocabulary
["I'm", "so", "happy", "because", "I", ..., "can't"]

Notice the word “I’m” appears in all of the messages. Since the vocabulary is a list of unique words, any words that appear in multiple messages will not be repeated.

Next, we’ll use this list of unique vocabulary words to create a frequency dictionary. For sentiment analysis, we work with two groups: positive messages and negative messages. You count how many times each word shows up in the positive messages and then in the negative messages. For example, if the word “happy” appears once in one positive message and once in another, its positive count is 2. You record all these counts in a table that maps each word and its class (positive or negative) to the number of times it appears.

Vocabulary	Positive	Negative
I’m	2	2
so	2	0
happy	2	0
because	1	1
I	1	1
finally	1	0
beat	1	1
that	1	1
level	1	1
sad	0	2
can’t	0	1

This table is your frequency table. It tells you, for every word, how often it appears in each type of message. In practice, this table is a dictionary mapping from a word class to its frequency.

Representing a Message as a Vector

Once you have your frequency dictionary, you can represent a message as a simple vector with three numbers:

Bias Unit: Always a 1.

Positive Frequency Sum: The sum of the counts for all the words in your message that appear in positive messages.

Negative Frequency Sum: The sum of the counts for all the words in your message that appear in negative messages.

For example, if a message’s words have a total positive frequency of 8 and a negative frequency of 11, then that message is represented by the vector [1, 8, 11].

💭

Based on our frequency table, which is obviously very limited, what would be the vector for this message? “I’m sad because I can’t go to the movies with my friends.”

Click to see the solution!

[1, 4, 7]

Preprocessing Messages

Most of the time when you begin processing messages, they might not be in the best format for analysis. For example, comments on TikTok often include the username at the beginning of the comment.

Before counting words, you need to clean up your comments! This process is called preprocessing, and it involves two major steps: removing stop words and stemming.

Stop Words: These are common words like “and,” “are,” “a,” and “at” that don’t add much meaning. You also remove punctuation, URLs, and usernames because they usually don’t help determine sentiment.

Stemming: This means reducing words to their base form. For example, “love,” “loving,” and “loved” all become “lov.” You’ll also convert all words to lowercase so that “OMGGG,” “Omggg,” and “omggg” are treated as the same word.

After this process, each comment only contains the words that contribute to its overall sentiment.

Building the Feature Matrix

Now that you’ve preprocessed your messages, you can represent each one as a 3-number vector. When you stack all these vectors together, you get a matrix where each row is a message and the columns are the three features (bias, positive frequency sum, negative frequency sum). This matrix is what you will feed into your logistic regression classifier!

An Overview of Logistic Regression

Next, let’s look at how logistic regression uses these features to predict sentiment. Logistic regression is a model that takes your feature matrix and predicts whether a message is positive or negative. It uses a special function called the sigmoid function, which outputs a probability between 0 and 1. If the probability is above 0.5, the message is classified as positive; if it’s below 0.5, it’s negative.

To make accurate predictions, the model needs to learn the best parameters (called theta). It does this by comparing its predictions to the true labels of your messages and then adjusting theta to reduce any errors. This adjustment process is done using an algorithm called gradient descent, which repeats the process until the error is minimized.

Training Your Model

To train your logistic regression model, you start by initializing theta with some values. Then, using gradient descent, you update parameters (called theta) based on the difference between the predicted and actual labels. After many iterations, theta converges to values that minimize the cost (error) of the model. Once the training is complete, you can use your trained model to predict the sentiment of new messages!

🧪

Lab | Sentiment Analysis

Ready to train your first NLP model? In this lab, we’ll analyze comments on YouTube animal videos to see if we can teach a computer to understand sentiment!

Your Tasks:

Open this this Colab notebook and Save a Copy to your Drive

Read the notes and finish the Try-It challenges

(Optional) Check out the source dataset here

Limitations of the “Bag of Words” Approach

While the logistic regression model you built is a powerful entry point to sentiment analysis, it relies on a Bag of Words approach that has significant limitations. This method treats a message as a simple collection of individual components, like a list of ingredients, rather than a structured “recipe” where the order and combination of those components impact the meaning!

Key Challenges in Frequency-Based Analysis

Failure to Detect Sarcasm: Simple models like this one often struggle with tone and context. For example, a word-frequency model may classify “Great, another error” as positive simply because it focuses on the word “great,” missing the sarcastic negative sentiment.

Disregard for Word Order: Frequency-based models cannot distinguish between sentences with identical words but opposite meanings! Both “The food was good, not bad” and “The food was bad, not good” contain the same “ingredients,” leading the model to assign them identical sentiment scores!

To overcome these limitations, the field of NLP has shifted toward Sequence Modeling. Unlike logistic regression and the Bag of Words approach, these models read text in order, allowing the AI to understand how a word’s meaning is modified by its surrounding context.

This evolution led to Transformers, the technology behind ChatGPT. Transformers use “attention” to weigh the importance of different words in a sentence, helping the machine grasp sarcasm and complex context. In the upcoming lessons, we’ll explore how these architectures move beyond counting words to truly understand language!

🤖 AI Connection

The difference between the bag of words approach and contextual models like Transformers can be tricky to wrap your head around. Ask an AI tool: "Can you give me an analogy that compares the bag of words approach to sentiment analysis with how contextual models like Transformers understand language?" Does the analogy help clarify the difference? If not, tell the AI what's still confusing and ask it to try a different one. Remember, you can always follow up!

💼 Takeaways

In this lesson, we learned how to train a model to analyze sentiment by converting text into numerical data and using logistic regression to classify messages as positive or negative.

Sentiment analysis helps determine if a message is positive or negative using machine learning.

Preprocessing includes removing stop words, punctuation, usernames, and URLs in addition to reducing words to their base form and lowercasing all words.

Messages are converted into numerical vectors to be passed to the logistic regression classifier model.

The model learns from a training set with labeled messages and adjusts parameters (theta) using gradient descent to minimize errors and improve accuracy.

For a summary of this lesson, check out the 8. Sentiment Analysis Lab One-Pager!