đŻÂ Learning Goals
- Define sentiment analysis and discuss its real-world applications (e.g., social media monitoring, customer feedback)
- Use Python libraries to perform basic text preprocessing tasks such as tokenization, stop word removal, and vectorization
- Build a logistic regression model to classify simple text statements as positive or negative
- Use accuracy measure to assess the performance of a sentiment analysis model
đ Technical Vocabulary
- Natural Language Processing (NLP)
- Sentiment Analysis
- Tokenization
- Stop Words
- Preprocessing
- Vectors
- Logistic Regression
Â
Warm-Up
Letâs play a quick game of sentiment analysis! For each message, identify the sentiment as positive or negative.
- "I just had the best coffee ever! Totally made my day âïžđâ
- âMissed my bus again and now I'm late to class. Mondays, am I right? đ©â
- âJust got my math test back and aced it! Feeling unstoppable today!â
- "Oh great, another school project due on Friday. Just what I needed to make my week even more exciting! đâ
What words, phrases, or emojis signaled the sentiment? How did punctuation or context influence your decision? How might you teach a machine to identify these same features?
As you can probably guess, teaching a machine to extract meaning from text is slightly more complicated than teaching a machine to predict if a student will pass or fail based on the number of hours they studied. In this lesson, weâll explore how we can teach a machine to understand the emotional tone in textâusing Python to preprocess data and logistic regression to classify statements as positive or negative.
NLP & Sentiment Analysis
Natural Language Processing (NLP) is a field of computer science and artificial intelligence that focuses on teaching computers to understand human language. Today, weâll dive into the exciting field of NLP by learning how machines understand language through a hands-on exercise with sentiment analysis. The objective of sentiment analysis is to predict whether a message has a positive or negative sentiment.
As you might imagine, there are many different approaches to this task! Some methods use deep learning neural networks, others rely on pre-built dictionaries of positive and negative words, and some combine multiple techniques. Today, we'll use a supervised machine learning approach with feature extraction, a foundational method in sentiment analysis, to gain an understanding of how machines make sense of human language.
Letâs take a look at an example:
âMachine learning is the best!â
Itâs clear to us that the sentiment is positive, but how do we teach a machine to recognize this? While this example is straightforward, real-world sentiment analysis faces many challenges. Sarcasm, mixed emotions, context-dependent meaning, and cultural differences all make this task more complex than it might initially appear.
To teach a machine to recognize sentiment, you'll follow these steps:
- Start with Labeled Data: Gather a set of messages where positive ones are labeled with a "1" and negative ones with a "0."
- Extract Features: Process these messages to pull out important cluesâlike key words or phrasesâthat help tell if a message is positive or negative.
- Train your model: Match these clues to the correct labels and adjust the model so its predictions are as accurate as possible.
- Classify New Messages: Once your model is trained, you can use it to decide if new messages, like the example above, have a positive or negative sentiment.
Creating a Frequency Dictionary
Today, we're going to explore one way to turn text into a vector, which is just a fancy way of saying we'll convert words into numbers so that a computer can understand them. First, you'll build a vocabulary, which is a list of all the unique words from a group of messages. Letâs imagine this is our list of messages:
- "Iâm so happy because I finally beat that level!â
- âIâm so happy!â
- âIâm sad.â
- âIâm sad because I canât beat that level.â
From these messages, we can build our vocabulary of only unique words. That might look something like this:
# vocabulary ["I'm", "so", "happy", "because", "I", ..., "can't"]
Notice the word âIâmâ appears in all of the messages. Since the vocabulary is a list of unique words, any words that appear in multiple messages will not be repeated.
Next, weâll use this list of unique vocabulary words to create a frequency dictionary. For sentiment analysis, we work with two groups: positive messages and negative messages. You count how many times each word shows up in the positive messages and then in the negative messages. For example, if the word âhappyâ appears once in one positive message and once in another, its positive count is 2. You record all these counts in a table that maps each word and its class (positive or negative) to the number of times it appears.
Vocabulary | Positive | Negative |
Iâm | 2 | 2 |
so | 2 | 0 |
happy | 2 | 0 |
because | 1 | 1 |
I | 1 | 1 |
finally | 1 | 0 |
beat | 1 | 1 |
that | 1 | 1 |
level | 1 | 1 |
sad | 0 | 2 |
canât | 0 | 1 |
This table is your frequency table. It tells you, for every word, how often it appears in each type of message. In practice, this table is a dictionary mapping from a word class to its frequency.
Representing a Message as a Vector
Once you have your frequency dictionary, you can represent a message as a simple vector with three numbers:
- Bias Unit: Always a 1.
- Positive Frequency Sum: The sum of the counts for all the words in your message that appear in positive messages.
- Negative Frequency Sum: The sum of the counts for all the words in your message that appear in negative messages.
For example, if a messageâs words have a total positive frequency of 8 and a negative frequency of 11, then that message is represented by the vector [1, 8, 11].
Based on our frequency table, which is obviously very limited, what would be the vector for this message? âIâm sad because I canât go to the movies with my friends.â
Click to see the solution!
[1, 4, 7]
Preprocessing Messages
Most of the time when you begin processing messages, they might not be in the best format for analysis. For example, comments on TikTok often include the username at the beginning of the comment.
Before counting words, you need to clean up your comments! This process is called preprocessing, and it involves two major steps: removing stop words and stemming.
- Stop Words: These are common words like âand,â âare,â âa,â and âatâ that donât add much meaning. You also remove punctuation, URLs, and usernames because they usually donât help determine sentiment.
- Stemming: This means reducing words to their base form. For example, âlove,â âloving,â and âlovedâ all become âlov.â Youâll also convert all words to lowercase so that âOMGGG,â âOmggg,â and âomgggâ are treated as the same word.
After this process, each comment only contains the words that contribute to its overall sentiment.
Building the Feature Matrix
Now that youâve preprocessed your messages, you can represent each one as a 3-number vector. When you stack all these vectors together, you get a matrix where each row is a message and the columns are the three features (bias, positive frequency sum, negative frequency sum). This matrix is what you will feed into your logistic regression classifier!
An Overview of Logistic Regression
Next, letâs look at how logistic regression uses these features to predict sentiment. Logistic regression is a model that takes your feature matrix and predicts whether a message is positive or negative. It uses a special function called the sigmoid function, which outputs a probability between 0 and 1. If the probability is above 0.5, the message is classified as positive; if itâs below 0.5, itâs negative.
To make accurate predictions, the model needs to learn the best parameters (called theta). It does this by comparing its predictions to the true labels of your messages and then adjusting theta to reduce any errors. This adjustment process is done using an algorithm called gradient descent, which repeats the process until the error is minimized.
Training Your Model
To train your logistic regression model, you start by initializing theta with some values. Then, using gradient descent, you update parameters (called theta) based on the difference between the predicted and actual labels. After many iterations, theta converges to values that minimize the cost (error) of the model. Once the training is complete, you can use your trained model to predict the sentiment of new messages!
Lab | Sentiment Analysis
Ready to train your first NLP model? In this lab, weâll analyze comments on YouTube animal videos to see if we can teach a computer to understand sentiment!
Your Tasks:
- Open this this Colab notebook and Save a Copy to your Drive
- Read the notes and finish the Try-It challenges
- (Optional) Check out the source dataset here
Limitations of the âBag of Wordsâ Approach
While the logistic regression model you built is a powerful entry point to sentiment analysis, it relies on a Bag of Words approach that has significant limitations. This method treats a message as a simple collection of individual components, like a list of ingredients, rather than a structured ârecipeâ where the order and combination of those components impact the meaning!
Key Challenges in Frequency-Based Analysis
- Failure to Detect Sarcasm: Simple models like this one often struggle with tone and context. For example, a word-frequency model may classify âGreat, another errorâ as positive simply because it focuses on the word âgreat,â missing the sarcastic negative sentiment.
- Disregard for Word Order: Frequency-based models cannot distinguish between sentences with identical words but opposite meanings! Both âThe food was good, not badâ and âThe food was bad, not goodâ contain the same âingredients,â leading the model to assign them identical sentiment scores!
To overcome these limitations, the field of NLP has shifted toward Sequence Modeling. Unlike logistic regression and the Bag of Words approach, these models read text in order, allowing the AI to understand how a wordâs meaning is modified by its surrounding context.
This evolution led to Transformers, the technology behind ChatGPT. Transformers use âattentionâ to weigh the importance of different words in a sentence, helping the machine grasp sarcasm and complex context. In the upcoming lessons, weâll explore how these architectures move beyond counting words to truly understand language!
đ€Â AI Connection
The difference between the bag of words approach and contextual models like Transformers can be tricky to wrap your head around. Ask an AI tool: "Can you give me an analogy that compares the bag of words approach to sentiment analysis with how contextual models like Transformers understand language?" Does the analogy help clarify the difference? If not, tell the AI what's still confusing and ask it to try a different one. Remember, you can always follow up!
đŒ Takeaways
In this lesson, we learned how to train a model to analyze sentiment by converting text into numerical data and using logistic regression to classify messages as positive or negative.
- Sentiment analysis helps determine if a message is positive or negative using machine learning.
- Preprocessing includes removing stop words, punctuation, usernames, and URLs in addition to reducing words to their base form and lowercasing all words.
- Messages are converted into numerical vectors to be passed to the logistic regression classifier model.
- The model learns from a training set with labeled messages and adjusts parameters (theta) using gradient descent to minimize errors and improve accuracy.
For a summary of this lesson, check out the 8. Sentiment Analysis Lab One-Pager!
Â