I little bit of notes about NLP. I will update this page when I have time.
Supervised ML
In this part, we will implement logistic regression which is a supervised learning method for sentiment analysis.
The general formula for supervised learning is to learn a function f that maps the input x to outputs y , using a dataset of input-output pairs $ \{ (x_i, y_i) \}_{i=1}^{n} $ . The goal is to find the function f that best fits the data.
$$ \hat{f} = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(f(\mathbf{x}_i), y_i) + \lambda \mathcal{R}(f) $$
where
- $\hat{f}$: The function learned from training data (your model).
- $\arg\min_{f \in \mathcal{F}}$: We are looking for the function $f$ in the hypothesis space $\mathcal{F}$ that minimizes the total cost.
- $\mathcal{F}$: The hypothesis space (set of all candidate models).
- $\mathcal{L}(f(\mathbf{x}_i), y_i)$: The loss function measuring the error between prediction and true label.
- $\frac{1}{n} \sum_{i=1}^{n}$: The empirical average over the training dataset.
- $\lambda \mathcal{R}(f)$: The regularization term, used to control model complexity and avoid overfitting.
- $\lambda$: A hyperparameter controlling the weight of regularization.
Sentiment Analysis
Sentiment analysis is a common task in NLP, where the goal is to classify text into different sentiment categories (e.g., positive, negative, neutral). In this example, we will use logistic regression to perform sentiment analysis on a dataset of tweets.
In the sentiment analysis, our variable X will be a phrase and our variable y will be a sentiment label (positive or negative).
Logistic Regression
Logistic regression estimates the probability that an instance belongs to a particular class. It is using sigmoid function to regress the probility (0, 1). The logistic function is defined as: $$ P(y = 1 \mid \mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}}, \quad \text{where } z = \mathbf{w}^\top \mathbf{x} + b $$
modelization
We have tweets $$ [tweet_1, tweet_2, tweet_3, …] $$ We have our vocabulary $$ V = [I, am, happy, sad, learn, machine, learning] $$
We will then convert a tweet into a vector of size $|V|$ (the size of the vocabulary). For example, if our tweet1 is “I am happy”, we will convert it into a vector of size 7 (the size of the vocabulary). The vector will be: $$ [1, 1, 1, 0, 0, 0, 0] $$ This is a sparse representation and it will become problematic when the vocabulary is large. In the logistic regression, we will then learn n+1 parameters (one for each word in the vocabulary and one for the bias). This will cause longer training time and inference time.
Preprocessing
We take a simple example of preprocessing twitter dataset from nltk library.
import nltk # Python library for NLP
from nltk.corpus import twitter_samples # sample Twitter dataset from NLTK
# downloads sample twitter dataset.
nltk.download('twitter_samples')
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
tweet = all_positive_tweets[2277]
print(tweet)
# My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i
The preprocessing techniques can vary greatly based on the specific requirements of the task and objectives. However, the following steps are commonly employed to enhance the quality and consistency of text data:
Remove unrelated elements
like the url, hashtags, and mentions.
import re
# remove old style retweet text "RT"
tweet2 = re.sub(r'^RT[\s]+', '', tweet)
# remove hyperlinks
tweet2 = re.sub(r'https?://[^\s\n\r]+', '', tweet2)
# remove hashtags
# only removing the hash # sign from the word
tweet2 = re.sub(r'#', '', tweet2)
# My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off…
Lowercasing:
Convert all text to lowercase to ensure uniformity. This helps in treating words like “He” and “he” as the same token during analysis.
Tokenization:
Break down the text into individual words or tokens. This step is vital for processing as it allows further operations to be performed on these individual units.
# instantiate tokenizer class
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
reduce_len=True)
# tokenize tweets
tweet_tokens = tokenizer.tokenize(tweet2)
# ['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']
Removing Stop Words:
Eliminate common words that contribute little to overall meaning, such as conjunctions (‘and’), prepositions (‘in’), and articles (’the’). Adjust the list of stop words based on domain-specific needs to retain meaningful elements that might be important for the context. Removing Punctuation: Strip punctuation marks like commas, periods, and exclamation points, as they generally do not carry semantic value. Be mindful, as punctuation can sometimes alter the meaning in certain contexts.
import re # library for regular expression operations
import string # for string operations
stopwords_english = stopwords.words('english')
for word in tweet_tokens: # Go through every word in your tokens list
if (word not in stopwords_english and # remove stopwords
word not in string.punctuation): # remove punctuation
tweets_clean.append(word)
print('removed stop words and punctuation:')
print(tweets_clean)
# ['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']
Stemming/Lemmatization:
Transform words to their base or root forms. Stemming involves cutting off affixes (e.g., “running” to “run”), while lemmatization considers the context and converts words to their dictionary form (e.g., “better” to “good”). Choosing between stemming and lemmatization depends on the task; lemmatization is generally more accurate but computationally heavier.
from nltk.stem import PorterStemmer # module for stemming
stemmer = PorterStemmer()
# Create an empty list to store the stems
tweets_stem = []
for word in tweets_clean:
stem_word = stemmer.stem(word) # stemming word
tweets_stem.append(stem_word) # append to the list
Handling Numerical Values: Decide how to process numbers in the text. You may choose to remove, retain, or convert them depending on the goals—e.g., replacing digits with their word equivalents.
Normalizing Text: Address variations like spelling differences and expand contractions (e.g., “don’t” to “do not”) to bring consistency to the text.
Dealing with Special Characters: Remove or replace special characters depending on their relevance to the task at hand.
Here is a simple version of a tweet processing function:
def process_tweet(tweet):
"""Process tweet function.
Input:
tweet: a string containing a tweet
Output:
tweets_clean: a list of words containing the processed tweet
"""
stemmer = PorterStemmer()
stopwords_english = stopwords.words('english')
# remove stock market tickers like $GE
tweet = re.sub(r'\$\w*', '', tweet)
# remove old style retweet text "RT"
tweet = re.sub(r'^RT[\s]+', '', tweet)
# remove hyperlinks
tweet = re.sub(r'https?://[^\s\n\r]+', '', tweet)
# remove hashtags
# only removing the hash # sign from the word
tweet = re.sub(r'#', '', tweet)
# tokenize tweets
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)
tweets_clean = []
for word in tweet_tokens:
if (word not in stopwords_english and # remove stopwords
word not in string.punctuation): # remove punctuation
# tweets_clean.append(word)
stem_word = stemmer.stem(word) # stemming word
tweets_clean.append(stem_word)
return tweets_clean
Frequence presentation
Use the frequence of a word in negative and positive classes to create a feature vector for each tweet. This can help reduce the dimensionality of the input space and improve the performance of the model. $$ X_m = [1, \sum_{w}freqs(w, 1), \sum_{w}freqs(w, 0)] $$ This is a vector of [biais, sum positive frequence, sum negative frequence]. So we reduce the dimension to 3.
def build_freqs(tweets, ys):
"""Build frequencies.
Input:
tweets: a list of tweets
ys: an m x 1 array with the sentiment label of each tweet
(either 0 or 1)
Output:
freqs: a dictionary mapping each (word, sentiment) pair to its
frequency
"""
# Convert np array to list since zip needs an iterable.
# The squeeze is necessary or the list ends up with one element.
# Also note that this is just a NOP if ys is already a list.
yslist = np.squeeze(ys).tolist()
# Start with an empty dictionary and populate it by looping over all tweets
# and over all processed words in each tweet.
freqs = {}
for y, tweet in zip(yslist, tweets):
for word in process_tweet(tweet):
pair = (word, y)
freqs[pair] = freqs.get(pair, 0) + 1
return freqs
By using build_freq, we can create the presentation $X_m$ to represent each tweet m.
Cost function
the cost function is a function that measures how well the model is performing. Based on the cost function, we can update the parameters of the model to minimize the cost. Cross entropy loss is commonly used for classification tasks, including logistic regression. It quantifies the difference between the predicted probabilities and the actual labels.The more q diverges from p (especially if it assigns low probability to the true class), the higher the cross-entropy. Cross-entropy is minimized when the predicted probabilities match the true distribution of the labels. $$ H(p, q) = -\sum_{x} p(x) \log(q(x)) $$ where
- $p(x)$ is the true distribution of the data (the actual labels)
- $q(x)$ is the predicted distribution (the model’s output probabilities)
In our case, we have two classes (positive and negative), so the cross-entropy loss can be expressed as: $$ J(\theta) = H(p, q) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] $$ where:
- $m$ is the number of training examples
- $y_i$ is the true label for the $i$-th example
- $\hat{y}_i$ is the predicted probability for the $i$-th example
And we have $$ \hat{y}_i = h(x_i, \theta_i) = \sigma(\theta^T x_i) $$ Where the h is the logistic function
Cross-entropy loss function visualization. src So when the true label is 1 (positive), the loss is the left part of the equation, and when the true label is 0 (negative), the loss is the right part of the equation. The loss is minimized when the predicted probability matches the true label.
Cost function and stastistics
In the sentiment analysis problem, we can modelise the probability that our model predicts the correct label as below: $$ P(y_i|x_i, \theta) = h(x_i, \theta)^{y_i} (1- h(x_i, \theta)^{y_i})^{1-y_i} $$ So the likelihood for all data is: $$ L(\theta) = \prod_{i=1}^{m} P(y_i|x_i, \theta) = \prod_{i=1}^{m} h(x_i, \theta)^{y_i} (1- h(x_i, \theta))^{1-y_i} $$
But with the number of data increase, the product can become very small and lead to numerical instability. Even with one small value, the overall product can become very small. So we use the log-likelihood instead: $$ \log L(\theta) = \sum_{i=1}^{m} \log P(y_i|x_i, \theta) = \sum_{i=1}^{m} y_i \log(h(x_i, \theta)) + (1-y_i) \log(1-h(x_i, \theta)) $$ And then we deviate the log-likelihood by the number of data to get the average log-likelihood: $$ \log L(\theta) = \frac{1}{m} \sum_{i=1}^{m} \log P(y_i|x_i, \theta) = \frac{1}{m} \sum_{i=1}^{m} y_i \log(h(x_i, \theta)) + (1-y_i) \log(1-h(x_i, \theta)) $$
We want to maximize the log-likelihood, so we can use the negative log-likelihood as our cost function: $$ J(\theta) = -\log L(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \log P(y_i|x_i, \theta) = -\frac{1}{m} \sum_{i=1}^{m} y_i \log(h(x_i, \theta)) + (1-y_i) \log(1-h(x_i, \theta)) $$ As the maximize the log-likelihood is equivalent to minimizing the negative log-likelihood, we can use the negative log-likelihood as our cost function. This is a common practice in machine learning, especially for classification tasks.
In the end, we have our cost function: $$ h(X, \theta) = g(X\theta) = \frac{1}{1 + e^{\mathbf{w}^\top \mathbf{x} + b}} $$
$$ J(\theta) = -\frac{1}{m}(y^{T}log(h) + (1-y)^{T}log(1-h)) $$
So by minimizing the cost function, we can find the optimal parameters $\theta$ that maximize the log-likelihood of the data. This is done using optimization algorithms like gradient descent or its variants.
Gradient Descent
Gradient descent is an optimization algorithm used to minimize the cost function by iteratively updating the model parameters. So, we will repeat the following steps until convergence. $$ \theta = \theta - \alpha \nabla J(\theta) $$ Where $\alpha$ is the learning rate, which controls the step size of the update, and $\nabla J(\theta)$ is the gradient of the cost function with respect to the parameters $\theta$. When $\theta$ is a vector, the gradient $\nabla J(\theta)$ is also a vector — specifically, a vector of partial derivatives of the loss function $J(\theta)$ with respect to each parameter.
$$ \left( \begin{array}{c} a_1 \newline a_2 \newline \vdots \newline a_n \end{array} \right) $$
$$ \nabla J(\theta) = \left( \begin{array}{c} \frac{\partial J}{\partial \theta_1} \newline \frac{\partial J}{\partial \theta_2} \newline \vdots \newline \frac{\partial J}{\partial \theta_n} \end{array} \right) $$
Different model have different functions so the way to calculate the gradient is different. But we have a the chain rule to simplify the calculation of the gradient. The chain rule states that if we have a function $f(g(x))$, then the derivative of $f$ with respect to $x$ is given by:
$$ \frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx} $$
This is also the key rule for backpropagation in neural networks, where we compute the gradient of the loss function with respect to the model parameters by applying the chain rule iteratively through the layers of the network. The framework like PyTorch or TensorFlow will automatically compute the gradients for us using the chain rule, so we don’t have to do it manually.
In our case, we can calculate the gradient of the cost function with respect to the parameters $\theta$ as follows:
The partial derivative of the sigmoid function:
$$ \begin{align*} h(x)’&=\left(\frac{1}{1+e^{-x}}\right)’=\frac{-(1+e^{-x})’}{(1+e^{-x})^2}=\frac{-1’-(e^{-x})’}{(1+e^{-x})^2}=\frac{0-(-x)’(e^{-x})}{(1+e^{-x})^2}=\frac{-(-1)(e^{-x})}{(1+e^{-x})^2}=\frac{e^{-x}}{(1+e^{-x})^2} \newline &=\left(\frac{1}{1+e^{-x}}\right)\left(\frac{e^{-x}}{1+e^{-x}}\right)=h(x)\left(\frac{+1-1 + e^{-x}}{1+e^{-x}}\right)=h(x)\left(\frac{1 + e^{-x}}{1+e^{-x}} - \frac{1}{1+e^{-x}}\right)=h(x)(1 - h(x)) \end{align*} $$
Gradient
$$\frac{\partial}{\partial \theta_j} J(\theta)$$ can be calculated as follows:
$$ \begin{align*} &= \frac{\partial}{\partial \theta_j} \frac{-1}{m}\sum_{i=1}^m \left [ y^{(i)} log ( h(x^{(i)}, \theta) ) + (1-y^{(i)}) log (1 - h(x^{(i)}, \theta)) \right ] \newline &= - \frac{1}{m}\sum_{i=1}^m \left [y^{(i)} \frac{\partial}{\partial \theta_j} log ( h(x^{(i)}, \theta)) + (1-y^{(i)}) \frac{\partial}{\partial \theta_j} log (1 - h(x^{(i)}, \theta))\right ] \newline &= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} \frac{\partial}{\partial \theta_j} h(x^{(i)}, \theta)}{ h(x^{(i)}, \theta)} + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} (1 - h(x^{(i)}, \theta))}{1 - h(x^{(i)}, \theta)}\right ] \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} \frac{\partial}{\partial \theta_j} h(x^{(i)}, \theta)}{ h(x^{(i)}, \theta)} + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} (1 - h(x^{(i)}, \theta))}{1 - h(x^{(i)}, \theta)}\right ] \end{align*} $$
Then we apply the chain rule to the derivative of the sigmoid function:
$$ -\frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} h(x^{(i)}, \theta) (1 - h(x^{(i)}, \theta)) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{ h(x^{(i)}, \theta)} + \frac{- (1-y^{(i)}) h(x^{(i)}, \theta)(1 - h(x^{(i)}, \theta)) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{1 - h(x^{(i)}, \theta)}\right ] $$
It can be simplified to: $$ -\frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} (1 - h(x^{(i)}, \theta)) - (1-y^{(i)}) h(x^{(i)}, \theta) \right ] x^{(i)}_j $$
$$ -\frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - y^{(i)} h(x^{(i)}, \theta) - h(x^{(i)}, \theta) + y^{(i)} h(x^{(i)}, \theta) \right ] x^{(i)}_j $$
$$ -\frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - h(x^{(i)}, \theta) \right ] x^{(i)}_j $$
$$ \frac{1}{m}\sum_{i=1}^m \left [ h(x^{(i)}, \theta) - y^{(i)} \right ] x^{(i)}_j $$
In the end we have $$ \nabla J(\theta) = \frac{1}{m} \ X^T (h(X, \theta) - Y) $$