Text Analysis with R

Text Analysis with R

1 Introduction

Text analysis is akin to one of those Choose Your Own Adventures where there are many paths to the end of the story. With R, there are many packages – tm, tidytext, tidyr, dplyr, ggplot2, quanteda, wordcloud – one could use when analyzing text. In this report, I will cover to paths which I will creatively refer to as: path 1 and path 2.

2 Using the tm package - path 1

For the first part of the analysis – we will take the following route:


As a point of order – the code to do this work will be in this github repository. It will not have all the repetitive output code you will see here for the reports sake.

3 Creating a Corpus

I like to think of the corpus as a folder of documents. It is the container that holds the text, which are located in the documents. The corpus has metadata and is a particular form of text that will make the conversion to Term Document Matrix or Document Term Matrix seemless. We will need to start by loading the tm package.

# load tm library
library(tm)


What I like about the corpus function is that it allows you to pull all the documents in at the same time. This command creates a virtual corpus (and there is also a corpus called a simple corpus as well). The three documents in this corpus are listed below.

# create a corpus
docs <- VCorpus(DirSource("~/TA-data-science-club/docs"))
summary(docs)
##                          Length Class             Mode
## TA_MHSPHP_PRESENTERS.txt 2      PlainTextDocument list
## TA_MHSPHP_USERS.txt      2      PlainTextDocument list
## tickets.txt              2      PlainTextDocument list


These documents represent text data from a chat box for a presentation on one of our information systems. The ‘users’ comments were from users who were present during a demonstration, and the ‘presenter’ comments was from a private chat box used by the presenters during the same demonstration. The other document is the descriptive comments for several years of trouble tickets for our information systems. The reason I picked these – is that they were easily available and because I needed to do a text analysis for the trouble tickets anyway. You would expect the two chat box documents to be very similar and the trouble tickets to be significantly different.

Now let’s take a look at the data that was uploaded.

# demonstrate lines in each document
library(readtext)
doc1 <- readLines("/Users/davidcarnahan/TA-data-science-club/docs/TA_MHSPHP_USERS.txt", skip = 39)
doc1[128:138]
##  [1] "Hermione Granger: No problem Rose :)"                                                                                                                                               
##  [2] "Fenrir Greyback: Thanbks - Albus, yes, that was my thought. I've heard opf a project led by the Federal Health Archtecture agency that may provide a mechanism for the satate data."
##  [3] "Albus Dumbledore: Unfortunately it doesnt.  That product looks up one patient at a time and is stand alone.  The DHA Pharmacy folks are looking at accessing that globally."        
##  [4] "Dedalus Diggle: when was the MyLayouts functionality added?"                                                                                                                        
##  [5] "Albus Dumbledore: CDR Scott--will commo off line via Outlook."                                                                                                                      
##  [6] "Luna Lovegood: belay my last irt protocol - understood - it's coming"                                                                                                               
##  [7] "Albus Dumbledore: :)"                                                                                                                                                               
##  [8] "Ron Weasley: My layouts was added about 2-3 weeks ago"                                                                                                                              
##  [9] "Hermione Granger: thanks Ron!"                                                                                                                                                      
## [10] "Dedalus Diggle: @Ron --- very cool!"                                                                                                                                                
## [11] "Albus Dumbledore: Thank you too Hermione!"


Based on this text, you can see that I modified the names of the participants in order to protect the innocent. The important thing to note here – is that you never know what you are going to get with text. So, you will want to be careful if you are doing text analysis on healthcare data given the HIPAA challenges with names and personal identity information.

4 Normalize the Data

This is where we will use the tm_map functions to remove numbers, space, punctuation, stop words, and more. We will begin with removing numbers.

# remove numbers
docs <- tm_map(docs, removeNumbers)

# print results for comparison
doc1 <- docs[[2]]$content
doc1[128:138]
##  [1] "Hermione Granger: No problem Rose :)"                                                                                                                                               
##  [2] "Fenrir Greyback: Thanbks - Albus, yes, that was my thought. I've heard opf a project led by the Federal Health Archtecture agency that may provide a mechanism for the satate data."
##  [3] "Albus Dumbledore: Unfortunately it doesnt.  That product looks up one patient at a time and is stand alone.  The DHA Pharmacy folks are looking at accessing that globally."        
##  [4] "Dedalus Diggle: when was the MyLayouts functionality added?"                                                                                                                        
##  [5] "Albus Dumbledore: CDR Scott--will commo off line via Outlook."                                                                                                                      
##  [6] "Luna Lovegood: belay my last irt protocol - understood - it's coming"                                                                                                               
##  [7] "Albus Dumbledore: :)"                                                                                                                                                               
##  [8] "Ron Weasley: My layouts was added about - weeks ago"                                                                                                                                
##  [9] "Hermione Granger: thanks Ron!"                                                                                                                                                      
## [10] "Dedalus Diggle: @Ron --- very cool!"                                                                                                                                                
## [11] "Albus Dumbledore: Thank you too Hermione!"


You really shouldn’t see a difference between this output and the one above because there were only a few numbers in the text. On line 8, the 2-3 weeks was reduced to ‘- weeks’. Nevertheless, it is important to remove numbers in case you do have them in the text. Now let’s add a space so we can then remove punctuation. You will see that the colon is snug up against the name – so that is why you need to add the space before removing the punctuation.

# create toSpace function
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})

# add space around dashes and colons
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, ":")

# remove punctuation & special character
docs <- tm_map(docs, removePunctuation)

# print results for comparison
doc1 <- docs[[2]]$content
doc1[128:138]
##  [1] "Hermione Granger  No problem Rose  "                                                                                                                                           
##  [2] "Fenrir Greyback  Thanbks   Albus yes that was my thought Ive heard opf a project led by the Federal Health Archtecture agency that may provide a mechanism for the satate data"
##  [3] "Albus Dumbledore  Unfortunately it doesnt  That product looks up one patient at a time and is stand alone  The DHA Pharmacy folks are looking at accessing that globally"      
##  [4] "Dedalus Diggle  when was the MyLayouts functionality added"                                                                                                                    
##  [5] "Albus Dumbledore  CDR Scott  will commo off line via Outlook"                                                                                                                  
##  [6] "Luna Lovegood  belay my last irt protocol   understood   its coming"                                                                                                           
##  [7] "Albus Dumbledore   "                                                                                                                                                           
##  [8] "Ron Weasley  My layouts was added about   weeks ago"                                                                                                                           
##  [9] "Hermione Granger  thanks Ron"                                                                                                                                                  
## [10] "Dedalus Diggle  Ron     very cool"                                                                                                                                             
## [11] "Albus Dumbledore  Thank you too Hermione"


You should observe that the first function toSpace is an autonomous function that we created using the content_transformer function in tm. This will be used again a little later but it is a nice way to wrap a function you want to create and use. Here you can start to see significant differences – the punctuation has disappeared.

# convert to lower case
docs <- tm_map(docs, content_transformer(tolower))

# print results for comparison
doc1 <- docs[[2]]$content
doc1[128:138]
##  [1] "hermione granger  no problem rose  "                                                                                                                                           
##  [2] "fenrir greyback  thanbks   albus yes that was my thought ive heard opf a project led by the federal health archtecture agency that may provide a mechanism for the satate data"
##  [3] "albus dumbledore  unfortunately it doesnt  that product looks up one patient at a time and is stand alone  the dha pharmacy folks are looking at accessing that globally"      
##  [4] "dedalus diggle  when was the mylayouts functionality added"                                                                                                                    
##  [5] "albus dumbledore  cdr scott  will commo off line via outlook"                                                                                                                  
##  [6] "luna lovegood  belay my last irt protocol   understood   its coming"                                                                                                           
##  [7] "albus dumbledore   "                                                                                                                                                           
##  [8] "ron weasley  my layouts was added about   weeks ago"                                                                                                                           
##  [9] "hermione granger  thanks ron"                                                                                                                                                  
## [10] "dedalus diggle  ron     very cool"                                                                                                                                             
## [11] "albus dumbledore  thank you too hermione"


It is ideal to convert all the text to lower case before you start removing words. Otherwise, you will have to deal with Letter Case, UPPERCASE, and lowercase – which means you would need to type each variation in for it to be removed. If you drop everything to lowercase – then there will be no case variation and only be one form of the word to remove. Now, let’s proceed to remove stop words and specialized words.

# optional vector to remove additional words
rm_add_words <- c("albus", "dumbledore", "harry", "potter", "hermione", "granger", "fenrir", "greyback", "dedalus", "diggle", "luna", "lovegood", "ron", "weasley", "delacour", "fleur", "macgonagall", "mcgonagall", "arthur", "laurel", "clearwater", "yaxley", "andromeda", "cedric", "penelope", "tonks", "marvolo", "corban", "mcgonagall", "minerva")

# remove stopwords & additional words
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, rm_add_words)

# print results for comparison
doc1 <- docs[[2]]$content
doc1[128:138]
##  [1] "    problem rose  "                                                                                                               
##  [2] "   thanbks    yes    thought ive heard opf  project led   federal health archtecture agency  may provide  mechanism   satate data"
##  [3] "   unfortunately  doesnt   product looks  one patient   time   stand alone   dha pharmacy folks  looking  accessing  globally"    
##  [4] "      mylayouts functionality added"                                                                                              
##  [5] "   cdr scott  will commo  line via outlook"                                                                                       
##  [6] "   belay  last irt protocol   understood    coming"                                                                               
##  [7] "    "                                                                                                                             
##  [8] "    layouts  added    weeks ago"                                                                                                  
##  [9] "   thanks "                                                                                                                       
## [10] "         cool"                                                                                                                    
## [11] "   thank   "


We are almost done with normalization. There are two more things we need to do. The first is to eliminate the unnecessary white space that we now have because of all the word, punctuation, and number elimination. The second will be to stem words that need to be stemmed after we do a frequency count to determine most common words.

# strip whitespace
docs <- tm_map(docs, stripWhitespace)

# manually stem 'thanks' to thank for more accurate counts
stemThanks <- content_transformer(function(x, pattern) {return (gsub(pattern, "thank", x))})
docs <- tm_map(docs, stemThanks, "thanks")

# manually stem 'patients' to thank for more accurate counts
stemPatients <- content_transformer(function(x, pattern) {return (gsub(pattern, "patient", x))})
docs <- tm_map(docs, stemPatients, "patients")

# print results for comparison
doc1 <- docs[[2]]$content
doc1[128:138]
##  [1] " problem rose "                                                                                                    
##  [2] " thanbks yes thought ive heard opf project led federal health archtecture agency may provide mechanism satate data"
##  [3] " unfortunately doesnt product looks one patient time stand alone dha pharmacy folks looking accessing globally"    
##  [4] " mylayouts functionality added"                                                                                    
##  [5] " cdr scott will commo line via outlook"                                                                            
##  [6] " belay last irt protocol understood coming"                                                                        
##  [7] " "                                                                                                                 
##  [8] " layouts added weeks ago"                                                                                          
##  [9] " thank "                                                                                                           
## [10] " cool"                                                                                                             
## [11] " thank "

5 Create a Document Term Matrix

Now we move to the step where we convert the corpus into a matrix of documents and terms. The name Document Term Matrix sounds very technical but really all we are talking about here is a table of numbers (or matrix) that shows the documents as rows and the terms (or words) as columns. The numbers in the matrix represents the number of times that the term showed up in each document. Pretty straightforward. If you are like me – you may be wondering why would you do this – the answer is because it makes if very easy to then create summary statistics and ultimately to plot them as we will do later.

# create document term matrix and term document matrix
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)

# create document term matrix for each document
dtm_PR <- DocumentTermMatrix(docs[1])
dtm_UR <- DocumentTermMatrix(docs[2])
dtm_TT <- DocumentTermMatrix(docs[3])

# visualize the document term matrix
inspect(dtm)
## <<DocumentTermMatrix (documents: 3, terms: 1942)>>
## Non-/sparse entries: 2348/3478
## Sparsity           : 60%
## Maximal term length: 55
## Weighting          : term frequency (tf)
## Sample             :
##                           Terms
## Docs                       access hrf log nutanix password patient portal
##   TA_MHSPHP_PRESENTERS.txt      3   0   0       0        0       4      0
##   TA_MHSPHP_USERS.txt           2   0   0       0        0      36      0
##   tickets.txt                 451  91 158      72       75      43    204
##                           Terms
## Docs                       report unable user
##   TA_MHSPHP_PRESENTERS.txt      4      0    1
##   TA_MHSPHP_USERS.txt          18      1    4
##   tickets.txt                  48    466  989


Notice how the sparsity rating is 60%. What this means is that 60% of the terms would be considered sparse. We can remove sparse terms and create a sparse matrix by using the removeSparseTerms function. You will see below that the sparsity term flips to zero and the number of terms slims down from 2348 to 192. The most obvious thing is that you’ve lost terms that had zeros in at least one document.

# convert to sparse matrix
dtms <- removeSparseTerms(dtm, 0.2)
# visualize the document term matrix
inspect(dtms)
## <<DocumentTermMatrix (documents: 3, terms: 64)>>
## Non-/sparse entries: 192/0
## Sparsity           : 0%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## Sample             :
##                           Terms
## Docs                       able access can patient phi report reports see
##   TA_MHSPHP_PRESENTERS.txt    1      3   9       4   2      4       1   3
##   TA_MHSPHP_USERS.txt         4      2  24      36   2     18       4  12
##   tickets.txt                49    451  27      43  35     48      48  35
##                           Terms
## Docs                       thank user
##   TA_MHSPHP_PRESENTERS.txt     2    1
##   TA_MHSPHP_USERS.txt         34    4
##   tickets.txt                  3  989

6 Create Summary Text

Now we can easily proceed to sum the columns to see what the frequency counts are for each term. I will do this for the overall dtm and for each document specific dtm. By the way, I created the document specific dtm so I could create a wordcloud for each document if I wanted to.

# determine frequency of words for all documents
freq_ALL <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
head(freq_ALL, 20)
##     user   unable   access   portal      log      hrf  patient password 
##      994      467      456      204      158       91       83       75 
##  nutanix   report      get     site  account      can     able  reports 
##       72       70       69       67       66       60       54       53 
##     data    issue    reset      see 
##       52       52       51       50
# determine frequency of presenter words
freq_PR <- sort(colSums(as.matrix(dtm_PR)), decreasing=TRUE)
head(freq_PR, 20)
##       can  question    lights      make   patient    report   traffic 
##         9         5         4         4         4         4         4 
##       yes    access     added      dont      good      know       lol 
##         4         3         3         3         3         3         3 
##      meds presenter       see       thx    active       ago 
##         3         3         3         3         2         2
# determine frequency of user words
freq_UR <- sort(colSums(as.matrix(dtm_UR)), decreasing=TRUE)
head(freq_UR, 20)
##  patient    thank      can     will   report registry      yes     data 
##       36       34       24       20       18       15       15       13 
##      see    sound     time     just     like     need   opioid      run 
##       12       10        9        8        8        8        8        8 
##    email     info     look      mtf 
##        7        7        7        7


Looking at the user comments most frequent terms, you can see how a document with a larger volume of terms like the trouble tickets dataset can skew the results (much along the lines of sampling and weighting in statistics). The most frequent term in the user comments dataset (‘patient’) is not even in the top 20 of the combined data set word frequency. This is why I decided to split out each document into its own document term matrix. The difference will become more profound when we create the wordclouds.

# determine frequency of trouble ticket words
freq_TT <- sort(colSums(as.matrix(dtm_TT)), decreasing=TRUE)
head(freq_TT, 20)
##     user   unable   access   portal      log      hrf password  nutanix 
##      989      466      451      204      158       91       75       72 
##  account      get     site    issue    reset     able    needs   report 
##       66       65       64       51       51       49       48       48 
##  reports requires   trying  patient 
##       48       48       48       43
## find frequent terms for the complete dtm
findMostFreqTerms(dtm, lowfreq = 20)
## $TA_MHSPHP_PRESENTERS.txt
##      can question   lights     make  patient   report 
##        9        5        4        4        4        4 
## 
## $TA_MHSPHP_USERS.txt
##  patient    thank      can     will   report registry 
##       36       34       24       20       18       15 
## 
## $tickets.txt
##   user unable access portal    log    hrf 
##    989    466    451    204    158     91


The last comment I’ll make in this section is that you can see that the tm package has a findMostFreqTerms function which is very nice.

7 Visualization

We are at the final step in path 1. I am generating a wordcloud for this visual but there are so many more visuals you can create. I happen to like the wordcloud but will definitely provide other examples in the future. For ideas on graphs that are possible, I highly recommend the book – Text Mining with R by Julia Silge and David Robinson.

library(RColorBrewer)
library(wordcloud)
# set grid to display wordclouds as 1 row and 2 columns
par(mfrow=c(1,2))

# get word cloud of trouble tickets comments
set.seed(1234)
wordcloud(names(freq_TT), freq_TT, 
          scale = c(4, 0.5),
          min.freq = 5,
          colors = brewer.pal(6, "Dark2"))

# get word cloud of user comments
set.seed(1235)
wordcloud(names(freq_UR), freq_UR, 
          scale = c(4, 0.5),
          min.freq = 5,
          colors = brewer.pal(6, "Dark2"))

1 Using the tidytext package - path 2

This section should be shorter because the tidytext package provides a more efficient path to summary text. I really like how this package tightly incorporates with other packages developed by Hadley Wickham (tidyr, dplyr, ggplot2, etc). The path we will take is outlined below.


First step, as always, is to load the packages you intend to use. I will be loading dplyr, ggplot2, and tidytext.

# load tidytext
library(readtext)
library(tidytext)
library(dplyr)
library(tidyr)
library(ggplot2)
data("stop_words")

2 Creating a tidytext tibble

There are actually a couple of ways you can do this – one is to simply take the document term matrix we built earlier and convert it using the tidy command. The other option below is to pull the data into dataframe and apply the unnest_tokens command.

# convert sparse document term matrix into tidy text
docs_td <- tidy(dtms)
docs_td
## # A tibble: 192 x 3
##                    document     term count
##                       <chr>    <chr> <dbl>
##  1 TA_MHSPHP_PRESENTERS.txt     able     1
##  2 TA_MHSPHP_PRESENTERS.txt   access     3
##  3 TA_MHSPHP_PRESENTERS.txt    added     3
##  4 TA_MHSPHP_PRESENTERS.txt      ago     2
##  5 TA_MHSPHP_PRESENTERS.txt    ahlta     1
##  6 TA_MHSPHP_PRESENTERS.txt  already     1
##  7 TA_MHSPHP_PRESENTERS.txt     back     1
##  8 TA_MHSPHP_PRESENTERS.txt      can     9
##  9 TA_MHSPHP_PRESENTERS.txt     chcs     1
## 10 TA_MHSPHP_PRESENTERS.txt contains     1
## # ... with 182 more rows
# convert document dtms into tidy text
PR_tidy <- tidy(dtm_PR)
UR_tidy <- tidy(dtm_UR)
TT_tidy <- tidy(dtm_TT)

# view results from UR_tidy
head(UR_tidy)
## # A tibble: 6 x 3
##              document      term count
##                 <chr>     <chr> <dbl>
## 1 TA_MHSPHP_USERS.txt   ability     1
## 2 TA_MHSPHP_USERS.txt      able     4
## 3 TA_MHSPHP_USERS.txt     abuse     2
## 4 TA_MHSPHP_USERS.txt    access     2
## 5 TA_MHSPHP_USERS.txt accessing     1
## 6 TA_MHSPHP_USERS.txt       acg     2
# second option of creating tidytext
new_docs <- as.data.frame(readtext("~/TA-data-science-club/docs"))
tidy_docs <- new_docs %>% unnest_tokens(word, text)

3 Normalize the data

This is where the efficiency of tidytext comes in. You need to make sure you have dplyr, and tidyr are loaded so you can use the pipe (%>%) delimiter to chain your code together. In four lines of code, we create a tibble (dataset) that has a term per document per row – and then removing stop words and special words as we did before.

# get special word list ready for anti-join by creating dataframe out of word vector
rm_add_words <- as.data.frame(rm_add_words, stringsAsFactors = FALSE)
names(rm_add_words) <- "word"

# create tidytext of users data
tidy_docs <- new_docs %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word") %>%
  anti_join(rm_add_words, by = "word")

# view first rows of tidy_docs
head(tidy_docs)
##                     doc_id    word
## 1              tickets.txt     240
## 2              tickets.txt     236
## 3 TA_MHSPHP_PRESENTERS.txt morning
## 4      TA_MHSPHP_USERS.txt morning
## 5              tickets.txt morning
## 6              tickets.txt morning

4 Create Summary Text

This portion is also very efficient. Pretty much one line of code for summary counts.

# determine most common words in data
tidy_docs %>% count(word, sort = TRUE)
## # A tibble: 2,988 x 2
##        word     n
##       <chr> <int>
##  1     user   995
##  2   unable   467
##  3   access   457
##  4   portal   204
##  5      log   158
##  6       cp   134
##  7      hrf    91
##  8 password    75
##  9     site    73
## 10  nutanix    72
## # ... with 2,978 more rows

5 Sentiment Analysis

The first step of sentiment analysis is to understand what sentiment lexicons are available in tidytext. I list them out so you can see the different approaches they take. Some simply create a binary variable (positive & negative), others create a distribution of words that correlate with different emotions, and one gives you a quantitative assessment on the severity of the negative or positivity of the words.

# explore sentiments
afinn <- get_sentiments(lexicon = "afinn")
bing <- get_sentiments(lexicon = "bing")
ncr <- get_sentiments(lexicon = "nrc")
loughran <- get_sentiments(lexicon = "loughran")

# list out lexicons
afinn
## # A tibble: 2,476 x 2
##          word score
##         <chr> <int>
##  1    abandon    -2
##  2  abandoned    -2
##  3   abandons    -2
##  4   abducted    -2
##  5  abduction    -2
##  6 abductions    -2
##  7      abhor    -3
##  8   abhorred    -3
##  9  abhorrent    -3
## 10     abhors    -3
## # ... with 2,466 more rows
bing
## # A tibble: 6,788 x 2
##           word sentiment
##          <chr>     <chr>
##  1     2-faced  negative
##  2     2-faces  negative
##  3          a+  positive
##  4    abnormal  negative
##  5     abolish  negative
##  6  abominable  negative
##  7  abominably  negative
##  8   abominate  negative
##  9 abomination  negative
## 10       abort  negative
## # ... with 6,778 more rows
ncr
## # A tibble: 13,901 x 2
##           word sentiment
##          <chr>     <chr>
##  1      abacus     trust
##  2     abandon      fear
##  3     abandon  negative
##  4     abandon   sadness
##  5   abandoned     anger
##  6   abandoned      fear
##  7   abandoned  negative
##  8   abandoned   sadness
##  9 abandonment     anger
## 10 abandonment      fear
## # ... with 13,891 more rows
loughran
## # A tibble: 4,149 x 2
##            word sentiment
##           <chr>     <chr>
##  1      abandon  negative
##  2    abandoned  negative
##  3   abandoning  negative
##  4  abandonment  negative
##  5 abandonments  negative
##  6     abandons  negative
##  7    abdicated  negative
##  8    abdicates  negative
##  9   abdicating  negative
## 10   abdication  negative
## # ... with 4,139 more rows


There is great flexibility with this package to create specific sentiment filters. For example, if I decided to look for the sentiment of ‘joy’ expressed by the users of our information system from the comments – what would that look like? In this example, you need to create a list of ‘joy’ words by filtering the lexicon for ‘joy’, and then do an inner-join with the tidy-docs text based on the common words. You will need the dplyr package to do the join.

# explore joy sentiment for user comments
nrcjoy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

# filter docs for MHSPHP user comments & look for joy words
tidy_docs %>%
  filter(doc_id == "TA_MHSPHP_USERS.txt") %>%
  inner_join(nrcjoy, by = "word") %>%
  count(word, sort = TRUE)
## # A tibble: 13 x 2
##        word     n
##       <chr> <int>
##  1  helpful     3
##  2  advance     2
##  3     cash     2
##  4   create     2
##  5    green     2
##  6     love     2
##  7      pay     2
##  8  charity     1
##  9 friendly     1
## 10    music     1
## 11  perfect     1
## 12     save     1
## 13    score     1

6 Visualization

Finally, with this particular project, we will visualize the volume of positive and negative words using ggplot2 by graphing a basic bar chart. The midline of ‘0’ is the neutral line, and anything below it is negative, and anything above is positive. It comes as no surprise that the trouble tickets data had an overwhelmingly negative sentiment, while the other two were positive.

library(tidyr)
comment_docs <- tidy_docs %>%
  inner_join(get_sentiments("nrc"), by = "word") %>%
  count(word, index = doc_id, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

# graph sentiments of comments
library(ggplot2)
ggplot(comment_docs, aes(index, sentiment, fill = index)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~index, ncol = 3, scales = "free_x")

7 Conclusion

This was a bit of a long post – so congratulations if you made it this far. However, we have only scratched the surface on the possible. My code is available via github if you want to simply run the code, and the sample datasets are available as well. There is sooo much more that can be done – such as topic models, correlational scatterplots, sentiment clouds, and more. I’ll probably cover these in a future blog post. Thanks again for visiting and feel free to contact me or leave a comment if you have any feedback on this project.