Big Data/Analytics Zone is brought to you in partnership with:

My obsession with data processing, analysis and visualization (especially using R) started in academia, and now it occupies both my work and hobby time. Come learn with me (or even teach me) as I figure out how to tackle tough data problems at work, or more likely than not try to use R to do some for-fun analyses! Matthew is a DZone MVB and is not an employee of DZone and has posted 10 posts at DZone. You can read more from them at their website. View Full User Profile

Topic Modeling in Python and R: A Rather Nosy Analysis of the Enron Email Corpus

  • submit to reddit

Having only ever played with Latent Dirichlet Allocation using gensim in python, I was very interested to see a nice example of this kind of topic modelling in R.  Whenever I see a really cool analysis done, I get the urge to do it myself.  What better corpus to do topic modelling on than the Enron email dataset?!?!?  Let me tell you, this thing is a monster!  According to the website I got it from, it contains about 500k messages, coming from 151 mostly senior management users and is organized into user folders.  I didn’t want to accept everything into my analysis, so I made the decision that I would only look into messages contained within the “sent” or “sent items” folders.

Being a large advocate of R, I really really tried to do all of the processing and analysis in R, but it was just too difficult and was taking up more time than I wanted. So I dusted off my python skills (thank you grad school!) and did the bulk of the data processing/preparation in python, and the text mining in R.  Following is the code (hopefully well enough commented) that I used to process the corpus in python:

docs = []
from os import listdir, chdir
import re

# Here's my attempt at coming up with regular expressions to filter out
# parts of the enron emails that I deem as useless.

email_pat = re.compile(".+@.+")
to_pat = re.compile("To:.+\n")
cc_pat = re.compile("cc:.+\n")
subject_pat = re.compile("Subject:.+\n")
from_pat = re.compile("From:.+\n")
sent_pat = re.compile("Sent:.+\n")
received_pat = re.compile("Received:.+\n")
ctype_pat = re.compile("Content-Type:.+\n")
reply_pat = re.compile("Reply- Organization:.+\n")
date_pat = re.compile("Date:.+\n")
xmail_pat = re.compile("X-Mailer:.+\n")
mimver_pat = re.compile("MIME-Version:.+\n")
contentinfo_pat = re.compile("----------------------------------------.+----------------------------------------")
forwardedby_pat = re.compile("----------------------.+----------------------")
caution_pat = re.compile('''\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*.+\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*''')
privacy_pat = re.compile(" _______________________________________________________________.+ _______________________________________________________________")

# The enron emails are in 151 directories representing each each senior management
# employee whose email account was entered into the dataset.
# The task here is to go into each folder, and enter each 
# email text file into one long nested list.
# I've used readlines() to read in the emails because read() 
# didn't seem to work with these email files.

names = [d for d in listdir(".") if "." not in d]
for name in names:
    chdir("/home/inkhorn/enron/%s" % name)
    subfolders = listdir('.')
    sent_dirs = [n for n, sf in enumerate(subfolders) if "sent" in sf]
    sent_dirs_words = [subfolders[i] for i in sent_dirs]
    for d in sent_dirs_words:
        chdir('/home/inkhorn/enron/%s/%s' % (name,d))
        file_list = listdir('.')
        docs.append([" ".join(open(f, 'r').readlines()) for f in file_list if "." in f])
# Here i go into each email from each employee, try to filter out all the useless stuff,
# then paste the email into one long flat list.  This is probably inefficient, but oh well - python
# is pretty fast anyway!

docs_final = []
for subfolder in docs:
    for email in subfolder:
        if ".nsf" in email:
            etype = ".nsf"
        elif ".pst" in email:
            etype = ".pst"
        email_new = email[email.find(etype)+4:]
        email_new = to_pat.sub('', email_new)
        email_new = cc_pat.sub('', email_new)
        email_new = subject_pat.sub('', email_new)
        email_new = from_pat.sub('', email_new)
        email_new = sent_pat.sub('', email_new)
        email_new = email_pat.sub('', email_new)
        if "-----Original Message-----" in email_new:
            email_new = email_new.replace("-----Original Message-----","")
        email_new = ctype_pat.sub('', email_new)
        email_new = reply_pat.sub('', email_new)
        email_new = date_pat.sub('', email_new)
        email_new = xmail_pat.sub('', email_new)
        email_new = mimver_pat.sub('', email_new)
        email_new = contentinfo_pat.sub('', email_new)
        email_new = forwardedby_pat.sub('', email_new)
        email_new = caution_pat.sub('', email_new)
        email_new = privacy_pat.sub('', email_new)

# Here I proceed to dump each and every email into about 126 thousand separate 
# txt files in a newly created 'data' directory.  This gets it ready for entry into a Corpus using the tm (textmining)
# package from R.

for n, doc in enumerate(docs_final):
    outfile = open("/home/inkhorn/enron/data/%s.txt" % n,'w')

After having seen python’s performance in rifling through these enron emails, I was very impressed!  It was very agile in creating a directory with the largest number of files I’d ever seen on my computer!

Okay, so now I had a directory filled with a whole lot of text files.  The next step was to bring them into R so that I could submit them to the LDA.  Following is the R code that I used:


# At this point, the python script should have been run, 
# creating about 126 thousand txt files.  I was very much afraid
# to import that many txt files into the tm package in R (my computer only
# runs on 8GB of RAM), so I decided to mark 60k of them for a sample, and move the
# rest of them into a separate directory

email_txts = list.files('data/')
email_txts_sample = sample(email_txts, 60000)
email_rename = data.frame(orig=email_txts_sample, new=sub(".txt",".rxr", email_txts_sample))
file.rename(str_c('data/',email_rename$orig), str_c('data/',email_rename$new))

# At this point, all of the non-sampled emails (labelled .txt, not .rxr)
# need to go into a different directory. I created a directory that I called
# nonsampled/ and moved the files there via the terminal command "mv *.txt nonsampled/".
# It's very important that you don't try to do this via a file explorer, windows or linux,
# as the act of trying to display that many file icons is apparently very difficult for a regular machine :$

enron = Corpus(DirSource("/home/inkhorn/enron/data"))

extendedstopwords=c("a","about","above","across","after","MIME Version","forwarded","again","against","all","almost","alone","along","already","also","although","always","am","among","an","and","another","any","anybody","anyone","anything","anywhere","are","area","areas","aren't","around","as","ask","asked","asking","asks","at","away","b","back","backed","backing","backs","be","became","because","become","becomes","been","before","began","behind","being","beings","below","best","better","between","big","both","but","by","c","came","can","cannot","can't","case","cases","certain","certainly","clear","clearly","come","could","couldn't","d","did","didn't","differ","different","differently","do","does","doesn't","doing","done","don't","down","downed","downing","downs","during","e","each","early","either","end","ended","ending","ends","enough","even","evenly","ever","every","everybody","everyone","everything","everywhere","f","face","faces","fact","facts","far","felt","few","find","finds","first","for","four","from","full","fully","further","furthered","furthering","furthers","g","gave","general","generally","get","gets","give","given","gives","go","going","good","goods","got","great","greater","greatest","group","grouped","grouping","groups","h","had","hadn't","has","hasn't","have","haven't","having","he","he'd","he'll","her","here","here's","hers","herself","he's","high","higher","highest","him","himself","his","how","however","how's","i","i'd","if","i'll","i'm","important","in","interest","interested","interesting","interests","into","is","isn't","it","its","it's","itself","i've","j","just","k","keep","keeps","kind","knew","know","known","knows","l","large","largely","last","later","latest","least","less","let","lets","let's","like","likely","long","longer","longest","m","made","make","making","man","many","may","me","member","members","men","might","more","most","mostly","mr","mrs","much","must","mustn't","my","myself","n","necessary","need","needed","needing","needs","never","new","newer","newest","next","no","nobody","non","noone","nor","not","nothing","now","nowhere","number","numbers","o","of","off","often","old","older","oldest","on","once","one","only","open","opened","opening","opens","or","order","ordered","ordering","orders","other","others","ought","our","ours","ourselves","out","over","own","p","part","parted","parting","parts","per","perhaps","place","places","point","pointed","pointing","points","possible","present","presented","presenting","presents","problem","problems","put","puts","q","quite","r","rather","really","right","room","rooms","s","said","same","saw","say","says","second","seconds","see","seem","seemed","seeming","seems","sees","several","shall","shan't","she","she'd","she'll","she's","should","shouldn't","show","showed","showing","shows","side","sides","since","small","smaller","smallest","so","some","somebody","someone","something","somewhere","state","states","still","such","sure","t","take","taken","than","that","that's","the","their","theirs","them","themselves","then","there","therefore","there's","these","they","they'd","they'll","they're","they've","thing","things","think","thinks","this","those","though","thought","thoughts","three","through","thus","to","today","together","too","took","toward","turn","turned","turning","turns","two","u","under","until","up","upon","us","use","used","uses","v","very","w","want","wanted","wanting","wants","was","wasn't","way","ways","we","we'd","well","we'll","wells","went","were","we're","weren't","we've","what","what's","when","when's","where","where's","whether","which","while","who","whole","whom","who's","whose","why","why's","will","with","within","without","won't","work","worked","working","works","would","wouldn't","x","y","year","years","yes","yet","you","you'd","you'll","young","younger","youngest","your","you're","yours","yourself","yourselves","you've","z")
dtm.control = list(
  tolower                         = T,
  removePunctuation         = T,
  removeNumbers                 = T,
  stopwords                         = c(stopwords("english"),extendedstopwords),
  stemming                         = T,
  wordLengths                 = c(3,Inf),
  weighting                         = weightTf)

dtm = DocumentTermMatrix(enron, control=dtm.control)
dtm = removeSparseTerms(dtm,0.999)
dtm = dtm[rowSums(as.matrix(dtm))>0,]

k = 4

# Beware: this step takes a lot of patience!  My computer was chugging along for probably 10 or so minutes before it completed the LDA here.
lda.model = LDA(dtm, k)

# This enables you to examine the words that make up each topic that was calculated.  Bear in mind that I've chosen to stem all words possible in this corpus, so some of the words output will look a little weird.

# Here I construct a dataframe that scores each document according to how closely its content 
# matches up with each topic.  The closer the score is to 0, the more likely its content matches
# up with a particular topic. 

emails.topics = posterior(lda.model, dtm)$topics
df.emails.topics =
df.emails.topics = cbind(email=as.character(rownames(df.emails.topics)), 
                         df.emails.topics, stringsAsFactors=F)

Phew, that took a lot of computing power! Now that it’s done, let’s look at the results of the command on line 48 from the above gist:

      Topic 1   Topic 2     Topic 3      Topic 4     
 [1,] "time"    "thank"     "market"     "email"     
 [2,] "vinc"    "pleas"     "enron"      "pleas"     
 [3,] "week"    "deal"      "power"      "messag"    
 [4,] "thank"   "enron"     "compani"    "inform"    
 [5,] "look"    "attach"    "energi"     "receiv"    
 [6,] "day"     "chang"     "price"      "intend"    
 [7,] "dont"    "call"      "gas"        "copi"      
 [8,] "call"    "agreement" "busi"       "attach"    
 [9,] "meet"    "question"  "manag"      "recipi"    
[10,] "hope"    "fax"       "servic"     "enron"     
[11,] "talk"    "america"   "rate"       "confidenti"
[12,] "ill"     "meet"      "trade"      "file"      
[13,] "tri"     "mark"      "provid"     "agreement" 
[14,] "night"   "kay"       "issu"       "thank"     
[15,] "friday"  "corp"      "custom"     "contain"   
[16,] "peopl"   "trade"     "california" "address"   
[17,] "bit"     "ena"       "oper"       "contact"   
[18,] "guy"     "north"     "cost"       "review"    
[19,] "love"    "discuss"   "electr"     "parti"     
[20,] "houston" "regard"    "report"     "contract"
Published at DZone with permission of Matthew Dubins, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)