R, Text Mining

Text Analytics: Mining Enron Emails

Mining Enron Emails

You might have heard about the Enron scandal that came to light in 2001 which eventually led to bankruptcy of the Enron corporation. This is the largest corporate fraud that had happened so far. The Enron top-honchos used what is called Mark-to-market accounting to make up their financial statements. They used this accounting and financial shenanigan to trick investors and make money which later went uncontrollable and finally led to the bankruptcy of the firm. You might like to go through this documentary Enron: The Smartest Guys in the Room before moving ahead in this article.

In this article, I will show how to use different text mining techniques to find out the sentiments of Enron executives when the scandal was taking its shape. Right after the scandal, emails exchanged between the Enron executives were made public. I will use those email data available in UCI Machine Learning Repository. This dataset is already available as the bag of words, so we don’t have to create the corpus from raw emails. Let’s get started and load the dataset

# loading libraries
library(dplyr)
library(ggplot2)
library(tidyr)

The Enron Email text collection is provided in two different files, docword.enron.txt which is a bag-of-words in sparse format with three columns docID, wordID and count, other file is vocab.enron.txt which is the vocab file (meaning word with their word id) with two columns word and wordID. The whole dataset contains 39861 documents (emails) identified by docID, 28102 words in the vocabulary and 6,400,000 total number of words in the collection. We need to merge these two files by wordID before we can carry out the analysis.

# reading enron email dataset
enron_vocab <- read.csv("vocab.enron.txt", header = FALSE, col.names = c("word"), 
 stringsAsFactors = FALSE) %>% 
 mutate(wordID = 1:28102)

enron_words <- read.csv("docword.enron.txt", header = FALSE, sep = " ", 
 col.names = c("docID", "wordID", "count"), skip = 3)

enron_words <- merge(enron_words, enron_vocab, by = "wordID") %>% 
 select(docID, word, count)

Now the dataset is loaded and ready for analysis, we will use the package tidytext to extract sentiment from the word corpus. The tidytext package comes in three different lexicons

But what is a lexicon? A lexicon is a set of vocabulary of a particular genre. For example, the AFINN lexicon assigns 2476 pre-defined words with a score of -5 to 5 where negative score means negative sentiment and vice versa. The nrc lexicon assigns 13901 predefined words into one of the 10 different sentiments viz positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise and trust. Whereas bing lexicon categorizes 6788 predefined words into either positive or negative sentiment. Let’s load these three lexicons into the workspace.

# sentiment scores
library(tidytext)
afinn <- get_sentiments("afinn")
nrc <- get_sentiments("nrc")
bing <- get_sentiments("bing")

Now we will calculate and analyze the dataset based on each one of these lexicons

AFINN lexicon

# enron sentiment based on AFINN lexicon
enron_afinn <- enron_words %>% 
 # Inner Join to AFINN lexicon
 inner_join(afinn, by = c("word" = "word")) %>%
 # Count by score and document ID
 count(score, docID)

enron_afinn_agg <- enron_afinn %>% 
 # Group by line
 group_by(docID) %>%
 # Sum scores by line
 summarize(total_score = sum(score))

ggplot(enron_afinn_agg, aes(docID, total_score)) +
 geom_smooth()

Now check the sentiment of enron emails by nrc lexicon

nrc lexicon

# enron sentiment based on nrc lexicon
enron_nrc <- inner_join(enron_words, nrc, by = c("word" = "word"))

# DataFrame of counts
enron_nrc <- enron_nrc %>% 
 # group by sentiment
 group_by(sentiment) %>% 
 # total count by sentiment
 summarize(total_count = sum(count))

# Plotting the sentiment counts
ggplot(enron_nrc, aes(x = sentiment, y = total_count)) +
 geom_col()

from the above sentiment bars, it is evident that they had a very positive sentiment among themselves followed by a sentiment of trust and anticipation among them which is again quite obvious as those executives were part of a common clique.

Let’s now check the sentiments by bing lexicon

bing lexicon

# enron sentiment by bing lexicon
enron_bing <- enron_words %>%
 # inner join to the lexicon
 inner_join(bing, by = c("word" = "word")) %>%
 # count by sentiment, index
 count(sentiment, docID) %>%
 # spreading the sentiments
 spread(sentiment, n, fill=0) %>%
 mutate(
 # adding polarity field
 polarity = positive - negative,
 # adding line number field
 docID = unique(docID)
 )

# plotting the sentiment
ggplot(enron_bing, aes(docID, polarity)) + 
 geom_smooth() +
 geom_hline(yintercept = 0, color = "red") +
 ggtitle("Enron Emails Chronological Polarity")

As you can see for the most of the mail documents, the sentiment among the executives is positive but few of the emails with docID near 30000 has some negative sentiments as well. This phenomenon is again worth analyzing to find out the exact reason behind this negative sentiment. You might try this on your own.

Let’s now move ahead with some other sentiment calculations

Word Frequency Analysis

# enron frequency analysis
enron_sents <- inner_join(enron_words, bing, by = c("word" = "word"))

# tidy sentiment calculation
enron_tidy_sentiment <- enron_sents %>% 
 count(word, sentiment, wt = count) %>%
 spread(sentiment, n, fill = 0) %>%
 mutate(polarity = positive - negative)

# subsetting the data for words with high polarity
enron_tidy_small <- enron_tidy_sentiment %>% 
 filter(abs(polarity) >= 1000)

# adding polarity
enron_tidy_pol <- enron_tidy_small %>% 
 mutate(
 pol = ifelse(polarity>0, "positive", "negative")
 )

# plotting the word frequency
ggplot(
 enron_tidy_pol, 
 aes(reorder(word, polarity), polarity, fill = pol)) +
 geom_bar(stat = "identity") + 
 ggtitle("Enron Emails: Sentiment Word Frequency") + 
 theme(axis.text.x = element_text(angle = 90, vjust = -0.1))

Now, let’s create a comparison cloud based on different emotions of nrc lexicon, we will drop the positive and negative emotions from the corpus just to focus more on the other eight emotions available in the nrc lexicon

WordCloud

# enron emotional introspection
enron_sentiment <- inner_join(enron_words, nrc)

# dropping positive or negative
enron_pos_neg <- enron_sentiment %>%
 filter(!grepl("positive|negative", sentiment))

# counting terms by sentiment then spread 
enron_tidy <- enron_pos_neg %>% 
 count(sentiment, term = word) %>% 
 spread(sentiment, n, fill = 0) %>%
 as.data.frame()

# setting row names
rownames(enron_tidy) <- enron_tidy[, 1]

# dropping terms column
enron_tidy[, 1] <- NULL

# comparison cloud
library(wordcloud)
comparison.cloud(enron_tidy, max.words = 200, title.size = 1.5)

Now finally, let’s conclude this article by creating a radarchart of the above eight emotions of nrc lexicon

Radar Chart

# enron radarchart
enron_sentiment <- inner_join(enron_words, nrc)

# dropping positive or negative
enron_pos_neg <- enron_sentiment %>%
 filter(!grepl("positive|negative", sentiment))

# tidy count
enron_tally <- enron_pos_neg %>%
 group_by(sentiment) %>%
 tally()

# JavaScript radar chart
library(radarchart)
chartJSRadar(enron_tally)

I hope you like this article. If you also want to share any insights about Enron, let us know in the comments below. Also, keep visiting the site for more such articles. Till then happy learning 🙂

Tagged , ,