I joined Twitter around six years ago although if you were to look at my profile you would see that I rarely tweet or have much interaction with other users. I only follow around 150 accounts, most of which are accounts relating to my professional interests. However, I do follow some of my friends from School and University. I may not be an active tweeter but I am a keen reader as I do find myself scrolling through hundreds of tweets daily. Over the months I’ve started to think that my friends only tweet about things that annoy them or cause them some dissatisfaction, so I decided to put that to the test.
In the blog post I will be checking the sentiment of my friend’s tweets from the last few years. Pulling freely available data from Kaggle I was able to train a Random Forest classifier (with the help of Natural Language Processing) and classify my friend’s tweets as ‘Positive’ and ‘Negative’. I used the Twitter API to pull my friend’s tweets and analysed everything in R.
Step 1: Get data to train a model and clean it
My first job was to get data, and a lot of it. Fortunately kaggle had just what I was looking for and provided me with 1.6 million tweets that were classified as positive or negative. Great. So I imported this into R (which didn’t take all that long surprisingly) and carried out some data preprocessing. This included:
subsetting the data (my laptop cannot handle a sparse matrix of 1.6 million tweets…)
removing unwanted columns
coercing the ‘Sentiment’ to be a factor
renaming columns
fixing the date (which required a couple of custom functions to wrangle it into a date object)
I had to load a number of packages to help with the analysis:
The
twitteR
package allowed me to pull my friend’s timeline of tweets using Twitter’s APItm
andSnowballC
contain a number of useful functions relating to the Natural Language Processing part of the projectcaTools
allows for an easy way of splitting test and training dataThe
randomForest
package was used to train a model on the tweets to learn the sentimentdplyr
,stringr
andlubridate
helped with data wranglingTo visualise the results I used
ggplot2
,cowplot
andggthemes
# Load packages
library(twitteR)
library(tm)
library(SnowballC)
library(caTools)
library(randomForest)
library(stringr)
library(lubridate)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(cowplot)
# Get Tweets to train model from Kaggle -----------------------------------
# Read in data
kaggle_tweets <- read.csv("C:/Users/Charliespackman/Documents/Data Science/Blog/New posts/Are my friends negative tweeters/kaggle tweets.csv",
header = FALSE,
stringsAsFactors = FALSE)
# Subset data
kaggle_tweets <- sample_n(kaggle_tweets, 10000)
# Remove unwanted columns
kaggle_tweets <- kaggle_tweets[c(-2, -4)]
# Rename columns
colnames(kaggle_tweets) <- c("Sentiment",
"Date",
"User",
"Tweet")
# Make Sentiment a factor
kaggle_tweets$Sentiment <- factor(kaggle_tweets[, 1],
levels = c("0", "4"),
labels = c("Negative", "Positive"))
# Custom function to transform month abbreviations to numerical values
fix_month <- function(x){
match(tolower(x), tolower(month.abb))}
# Custom function to transform date
fix_date <- function(x){
new_date <- as_date(paste(substr(x, 25, 29),
paste0("0", fix_month(substr(x, 5, 7))),
substr(x, 9, 10)), tz = NULL)
return(new_date)}
# Transform the dates in the data set
kaggle_tweets$Date <- fix_date(kaggle_tweets$Date)
Opening a bag of worms
Next I had to transform the data so a model could learn from it. To do this I used Natural Language Processing. The packages used were tm
and SnowballC
. I started by creating a corpus; that is a huge matrix where every column contains a word and every row contains a tweet. The cells represent the frequency of the word within the tweet. For example, the first tweet in the dataset was “I was watching Bones last night, Ep ‘The Headless Witch in the Woods’, being me, I got scared”. The corpus would contain columns representing the words of the tweet and the first row representing this tweet will contain 1’s since all words are present in this tweet.
corpus = VCorpus(VectorSource(kaggle_tweets$Tweet)) # Creating the corpus
Once the corpus was created I had to clean the tweets so that all numbers and punctuation was removed, words were stemmed, white space and stop words were removed as well as making everything lowercase.
corpus = tm_map(corpus, content_transformer(tolower)) # all letters to lowercase
corpus = tm_map(corpus, removeNumbers) # removing numbers
corpus = tm_map(corpus, removePunctuation) # removing punctuation
corpus = tm_map(corpus, removeWords, stopwords()) # removing stopwords
corpus = tm_map(corpus, stemDocument) # steming words
corpus = tm_map(corpus, stripWhitespace) # removing whitepace
I then transformed the corpus into a more workable object called a ‘Document Term Matrix’ and reduced the sparsity of the matrix by removing words with little frequency. This makes the sparse matrix smaller and helps reduce the load on the model when learning. Finally I added the Sentiment of the tweets to the sparse matrix so that the model can learn.
dtm <- DocumentTermMatrix(corpus) # Transforming corpus to a workable matrix
dtm_sparse <- removeSparseTerms(dtm, 0.999999) # reducing sparsity
dataset <- as.data.frame(as.matrix(dtm_sparse)) # transforming into a data frame
dataset$Sentiment <- kaggle_tweets$Sentiment # creating a new dataset for the model
dataset <- dataset %>%
select(Sentiment, everything())
Training the model
The first step in training the model was to split the dataset into test and training sets. The caTools
package helps with this.
# Split into test and train set
split_vector = sample.split(dataset$Sentiment, SplitRatio = 0.8)
training_set = subset(dataset, split_vector == TRUE)
test_set = subset(dataset, split_vector == FALSE)
After splitting the data I fed the training data to a Random Forest algorithm. This was made easy with the randomForest
package. I opted for 5 trees to reduce the load on my laptop. Once the model was trained (took about 20 minutes…). I tested the model on the test set and computed a confusion matrix to test the accuracy.
# Training the random forest classifier
classifier <- randomForest(x = training_set[-1],
y = training_set$Sentiment,
ntree = 5)
# Test classifier on test set
test_prediction <- predict(classifier, newdata = test_set[-1])
cm <- table(test_set[, 1], test_prediction)
## test_prediction
## Negative Positive
## Negative 690 322
## Positive 288 700
The model didn’t perform too badly considering the small number of tweets that were used. Given the chance I would have loved to have trained a more robust model with more trees and more data but I had to work with the restriction of my laptop’s computing power. Perhaps in a later blog post I will carry out something similar using cloud computing in Azure or AWS.
Now that the model was tested and I was happy with the outcome my next step was to retrieve my friend’s tweets and test my hypothesis.
Click here for part 2.