Mining Enron Emails
You might have heard about the Enron scandal that came to light in 2001 which eventually led to bankruptcy of the Enron corporation. This is the largest corporate fraud that had happened so far. The Enron top-honchos used what is called Mark-to-market accounting to make up their financial statements. They used this accounting and financial shenanigan to trick investors and make money which later went uncontrollable and finally led to the bankruptcy of the firm. You might like to go through this documentary Enron: The Smartest Guys in the Room before moving ahead in this article.
In this article, I will show how to use different text mining techniques to find out the sentiments of Enron executives when the scandal was taking its shape. Right after the scandal, emails exchanged between the Enron executives were made public. I will use those email data available in UCI Machine Learning Repository. This dataset is already available as the bag of words, so we don’t have to create the corpus from raw emails. Let’s get started and load the dataset
# loading libraries library(dplyr) library(ggplot2) library(tidyr)
The Enron Email text collection is provided in two different files, docword.enron.txt which is a bag-of-words in sparse format with three columns docID, wordID and count, other file is vocab.enron.txt which is the vocab file (meaning word with their word id) with two columns word and wordID. The whole dataset contains 39861 documents (emails) identified by docID, 28102 words in the vocabulary and 6,400,000 total number of words in the collection. We need to merge these two files by wordID before we can carry out the analysis.
# reading enron email dataset enron_vocab <- read.csv("vocab.enron.txt", header = FALSE, col.names = c("word"), stringsAsFactors = FALSE) %>% mutate(wordID = 1:28102) enron_words <- read.csv("docword.enron.txt", header = FALSE, sep = " ", col.names = c("docID", "wordID", "count"), skip = 3) enron_words <- merge(enron_words, enron_vocab, by = "wordID") %>% select(docID, word, count)
Now the dataset is loaded and ready for analysis, we will use the package tidytext to extract sentiment from the word corpus. The tidytext package comes in three different lexicons
But what is a lexicon? A lexicon is a set of vocabulary of a particular genre. For example, the AFINN lexicon assigns 2476 pre-defined words with a score of -5 to 5 where negative score means negative sentiment and vice versa. The nrc lexicon assigns 13901 predefined words into one of the 10 different sentiments viz positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise and trust. Whereas bing lexicon categorizes 6788 predefined words into either positive or negative sentiment. Let’s load these three lexicons into the workspace.
# sentiment scores library(tidytext) afinn <- get_sentiments("afinn") nrc <- get_sentiments("nrc") bing <- get_sentiments("bing")
Now we will calculate and analyze the dataset based on each one of these lexicons
# enron sentiment based on AFINN lexicon enron_afinn <- enron_words %>% # Inner Join to AFINN lexicon inner_join(afinn, by = c("word" = "word")) %>% # Count by score and document ID count(score, docID) enron_afinn_agg <- enron_afinn %>% # Group by line group_by(docID) %>% # Sum scores by line summarize(total_score = sum(score)) ggplot(enron_afinn_agg, aes(docID, total_score)) + geom_smooth()
Now check the sentiment of enron emails by nrc lexicon
# enron sentiment based on nrc lexicon enron_nrc <- inner_join(enron_words, nrc, by = c("word" = "word")) # DataFrame of counts enron_nrc <- enron_nrc %>% # group by sentiment group_by(sentiment) %>% # total count by sentiment summarize(total_count = sum(count)) # Plotting the sentiment counts ggplot(enron_nrc, aes(x = sentiment, y = total_count)) + geom_col()
from the above sentiment bars, it is evident that they had a very positive sentiment among themselves followed by a sentiment of trust and anticipation among them which is again quite obvious as those executives were part of a common clique.
Let’s now check the sentiments by bing lexicon
# enron sentiment by bing lexicon enron_bing <- enron_words %>% # inner join to the lexicon inner_join(bing, by = c("word" = "word")) %>% # count by sentiment, index count(sentiment, docID) %>% # spreading the sentiments spread(sentiment, n, fill=0) %>% mutate( # adding polarity field polarity = positive - negative, # adding line number field docID = unique(docID) ) # plotting the sentiment ggplot(enron_bing, aes(docID, polarity)) + geom_smooth() + geom_hline(yintercept = 0, color = "red") + ggtitle("Enron Emails Chronological Polarity")
As you can see for the most of the mail documents, the sentiment among the executives is positive but few of the emails with docID near 30000 has some negative sentiments as well. This phenomenon is again worth analyzing to find out the exact reason behind this negative sentiment. You might try this on your own.
Let’s now move ahead with some other sentiment calculations
Word Frequency Analysis
# enron frequency analysis enron_sents <- inner_join(enron_words, bing, by = c("word" = "word")) # tidy sentiment calculation enron_tidy_sentiment <- enron_sents %>% count(word, sentiment, wt = count) %>% spread(sentiment, n, fill = 0) %>% mutate(polarity = positive - negative) # subsetting the data for words with high polarity enron_tidy_small <- enron_tidy_sentiment %>% filter(abs(polarity) >= 1000) # adding polarity enron_tidy_pol <- enron_tidy_small %>% mutate( pol = ifelse(polarity>0, "positive", "negative") ) # plotting the word frequency ggplot( enron_tidy_pol, aes(reorder(word, polarity), polarity, fill = pol)) + geom_bar(stat = "identity") + ggtitle("Enron Emails: Sentiment Word Frequency") + theme(axis.text.x = element_text(angle = 90, vjust = -0.1))
Now, let’s create a comparison cloud based on different emotions of nrc lexicon, we will drop the positive and negative emotions from the corpus just to focus more on the other eight emotions available in the nrc lexicon
# enron emotional introspection enron_sentiment <- inner_join(enron_words, nrc) # dropping positive or negative enron_pos_neg <- enron_sentiment %>% filter(!grepl("positive|negative", sentiment)) # counting terms by sentiment then spread enron_tidy <- enron_pos_neg %>% count(sentiment, term = word) %>% spread(sentiment, n, fill = 0) %>% as.data.frame() # setting row names rownames(enron_tidy) <- enron_tidy[, 1] # dropping terms column enron_tidy[, 1] <- NULL # comparison cloud library(wordcloud) comparison.cloud(enron_tidy, max.words = 200, title.size = 1.5)
Now finally, let’s conclude this article by creating a radarchart of the above eight emotions of nrc lexicon
I hope you like this article. If you also want to share any insights about Enron, let us know in the comments below. Also, keep visiting the site for more such articles. Till then happy learning 🙂