Text Analysis with R
1 Introduction
Text analysis is akin to one of those Choose Your Own Adventures where there are many paths to the end of the story. With R, there are many packages – tm
, tidytext
, tidyr
, dplyr
, ggplot2
, quanteda
, wordcloud
– one could use when analyzing text. In this report, I will cover to paths which I will creatively refer to as: path 1 and path 2.
2 Using the tm
package - path 1
For the first part of the analysis – we will take the following route:
As a point of order – the code to do this work will be in this github repository. It will not have all the repetitive output code you will see here for the reports sake.
3 Creating a Corpus
I like to think of the corpus as a folder of documents. It is the container that holds the text, which are located in the documents. The corpus has metadata and is a particular form of text that will make the conversion to Term Document Matrix or Document Term Matrix seemless. We will need to start by loading the tm
package.
# load tm library
library(tm)
What I like about the corpus function is that it allows you to pull all the documents in at the same time. This command creates a virtual corpus (and there is also a corpus called a simple corpus as well). The three documents in this corpus are listed below.
# create a corpus
docs <- VCorpus(DirSource("~/TA-data-science-club/docs"))
summary(docs)
## Length Class Mode
## TA_MHSPHP_PRESENTERS.txt 2 PlainTextDocument list
## TA_MHSPHP_USERS.txt 2 PlainTextDocument list
## tickets.txt 2 PlainTextDocument list
These documents represent text data from a chat box for a presentation on one of our information systems. The ‘users’ comments were from users who were present during a demonstration, and the ‘presenter’ comments was from a private chat box used by the presenters during the same demonstration. The other document is the descriptive comments for several years of trouble tickets for our information systems. The reason I picked these – is that they were easily available and because I needed to do a text analysis for the trouble tickets anyway. You would expect the two chat box documents to be very similar and the trouble tickets to be significantly different.
Now let’s take a look at the data that was uploaded.
# demonstrate lines in each document
library(readtext)
doc1 <- readLines("/Users/davidcarnahan/TA-data-science-club/docs/TA_MHSPHP_USERS.txt", skip = 39)
doc1[128:138]
## [1] "Hermione Granger: No problem Rose :)"
## [2] "Fenrir Greyback: Thanbks - Albus, yes, that was my thought. I've heard opf a project led by the Federal Health Archtecture agency that may provide a mechanism for the satate data."
## [3] "Albus Dumbledore: Unfortunately it doesnt. That product looks up one patient at a time and is stand alone. The DHA Pharmacy folks are looking at accessing that globally."
## [4] "Dedalus Diggle: when was the MyLayouts functionality added?"
## [5] "Albus Dumbledore: CDR Scott--will commo off line via Outlook."
## [6] "Luna Lovegood: belay my last irt protocol - understood - it's coming"
## [7] "Albus Dumbledore: :)"
## [8] "Ron Weasley: My layouts was added about 2-3 weeks ago"
## [9] "Hermione Granger: thanks Ron!"
## [10] "Dedalus Diggle: @Ron --- very cool!"
## [11] "Albus Dumbledore: Thank you too Hermione!"
Based on this text, you can see that I modified the names of the participants in order to protect the innocent. The important thing to note here – is that you never know what you are going to get with text. So, you will want to be careful if you are doing text analysis on healthcare data given the HIPAA challenges with names and personal identity information.
4 Normalize the Data
This is where we will use the tm_map functions to remove numbers, space, punctuation, stop words, and more. We will begin with removing numbers.
# remove numbers
docs <- tm_map(docs, removeNumbers)
# print results for comparison
doc1 <- docs[[2]]$content
doc1[128:138]
## [1] "Hermione Granger: No problem Rose :)"
## [2] "Fenrir Greyback: Thanbks - Albus, yes, that was my thought. I've heard opf a project led by the Federal Health Archtecture agency that may provide a mechanism for the satate data."
## [3] "Albus Dumbledore: Unfortunately it doesnt. That product looks up one patient at a time and is stand alone. The DHA Pharmacy folks are looking at accessing that globally."
## [4] "Dedalus Diggle: when was the MyLayouts functionality added?"
## [5] "Albus Dumbledore: CDR Scott--will commo off line via Outlook."
## [6] "Luna Lovegood: belay my last irt protocol - understood - it's coming"
## [7] "Albus Dumbledore: :)"
## [8] "Ron Weasley: My layouts was added about - weeks ago"
## [9] "Hermione Granger: thanks Ron!"
## [10] "Dedalus Diggle: @Ron --- very cool!"
## [11] "Albus Dumbledore: Thank you too Hermione!"
You really shouldn’t see a difference between this output and the one above because there were only a few numbers in the text. On line 8, the 2-3 weeks was reduced to ‘- weeks’. Nevertheless, it is important to remove numbers in case you do have them in the text. Now let’s add a space so we can then remove punctuation. You will see that the colon is snug up against the name – so that is why you need to add the space before removing the punctuation.
# create toSpace function
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
# add space around dashes and colons
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, ":")
# remove punctuation & special character
docs <- tm_map(docs, removePunctuation)
# print results for comparison
doc1 <- docs[[2]]$content
doc1[128:138]
## [1] "Hermione Granger No problem Rose "
## [2] "Fenrir Greyback Thanbks Albus yes that was my thought Ive heard opf a project led by the Federal Health Archtecture agency that may provide a mechanism for the satate data"
## [3] "Albus Dumbledore Unfortunately it doesnt That product looks up one patient at a time and is stand alone The DHA Pharmacy folks are looking at accessing that globally"
## [4] "Dedalus Diggle when was the MyLayouts functionality added"
## [5] "Albus Dumbledore CDR Scott will commo off line via Outlook"
## [6] "Luna Lovegood belay my last irt protocol understood its coming"
## [7] "Albus Dumbledore "
## [8] "Ron Weasley My layouts was added about weeks ago"
## [9] "Hermione Granger thanks Ron"
## [10] "Dedalus Diggle Ron very cool"
## [11] "Albus Dumbledore Thank you too Hermione"
You should observe that the first function toSpace is an autonomous function that we created using the content_transformer function in tm
. This will be used again a little later but it is a nice way to wrap a function you want to create and use. Here you can start to see significant differences – the punctuation has disappeared.
# convert to lower case
docs <- tm_map(docs, content_transformer(tolower))
# print results for comparison
doc1 <- docs[[2]]$content
doc1[128:138]
## [1] "hermione granger no problem rose "
## [2] "fenrir greyback thanbks albus yes that was my thought ive heard opf a project led by the federal health archtecture agency that may provide a mechanism for the satate data"
## [3] "albus dumbledore unfortunately it doesnt that product looks up one patient at a time and is stand alone the dha pharmacy folks are looking at accessing that globally"
## [4] "dedalus diggle when was the mylayouts functionality added"
## [5] "albus dumbledore cdr scott will commo off line via outlook"
## [6] "luna lovegood belay my last irt protocol understood its coming"
## [7] "albus dumbledore "
## [8] "ron weasley my layouts was added about weeks ago"
## [9] "hermione granger thanks ron"
## [10] "dedalus diggle ron very cool"
## [11] "albus dumbledore thank you too hermione"
It is ideal to convert all the text to lower case before you start removing words. Otherwise, you will have to deal with Letter Case, UPPERCASE, and lowercase – which means you would need to type each variation in for it to be removed. If you drop everything to lowercase – then there will be no case variation and only be one form of the word to remove. Now, let’s proceed to remove stop words and specialized words.
# optional vector to remove additional words
rm_add_words <- c("albus", "dumbledore", "harry", "potter", "hermione", "granger", "fenrir", "greyback", "dedalus", "diggle", "luna", "lovegood", "ron", "weasley", "delacour", "fleur", "macgonagall", "mcgonagall", "arthur", "laurel", "clearwater", "yaxley", "andromeda", "cedric", "penelope", "tonks", "marvolo", "corban", "mcgonagall", "minerva")
# remove stopwords & additional words
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, rm_add_words)
# print results for comparison
doc1 <- docs[[2]]$content
doc1[128:138]
## [1] " problem rose "
## [2] " thanbks yes thought ive heard opf project led federal health archtecture agency may provide mechanism satate data"
## [3] " unfortunately doesnt product looks one patient time stand alone dha pharmacy folks looking accessing globally"
## [4] " mylayouts functionality added"
## [5] " cdr scott will commo line via outlook"
## [6] " belay last irt protocol understood coming"
## [7] " "
## [8] " layouts added weeks ago"
## [9] " thanks "
## [10] " cool"
## [11] " thank "
We are almost done with normalization. There are two more things we need to do. The first is to eliminate the unnecessary white space that we now have because of all the word, punctuation, and number elimination. The second will be to stem words that need to be stemmed after we do a frequency count to determine most common words.
# strip whitespace
docs <- tm_map(docs, stripWhitespace)
# manually stem 'thanks' to thank for more accurate counts
stemThanks <- content_transformer(function(x, pattern) {return (gsub(pattern, "thank", x))})
docs <- tm_map(docs, stemThanks, "thanks")
# manually stem 'patients' to thank for more accurate counts
stemPatients <- content_transformer(function(x, pattern) {return (gsub(pattern, "patient", x))})
docs <- tm_map(docs, stemPatients, "patients")
# print results for comparison
doc1 <- docs[[2]]$content
doc1[128:138]
## [1] " problem rose "
## [2] " thanbks yes thought ive heard opf project led federal health archtecture agency may provide mechanism satate data"
## [3] " unfortunately doesnt product looks one patient time stand alone dha pharmacy folks looking accessing globally"
## [4] " mylayouts functionality added"
## [5] " cdr scott will commo line via outlook"
## [6] " belay last irt protocol understood coming"
## [7] " "
## [8] " layouts added weeks ago"
## [9] " thank "
## [10] " cool"
## [11] " thank "
5 Create a Document Term Matrix
Now we move to the step where we convert the corpus into a matrix of documents and terms. The name Document Term Matrix sounds very technical but really all we are talking about here is a table of numbers (or matrix) that shows the documents as rows and the terms (or words) as columns. The numbers in the matrix represents the number of times that the term showed up in each document. Pretty straightforward. If you are like me – you may be wondering why would you do this – the answer is because it makes if very easy to then create summary statistics and ultimately to plot them as we will do later.
# create document term matrix and term document matrix
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
# create document term matrix for each document
dtm_PR <- DocumentTermMatrix(docs[1])
dtm_UR <- DocumentTermMatrix(docs[2])
dtm_TT <- DocumentTermMatrix(docs[3])
# visualize the document term matrix
inspect(dtm)
## <<DocumentTermMatrix (documents: 3, terms: 1942)>>
## Non-/sparse entries: 2348/3478
## Sparsity : 60%
## Maximal term length: 55
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs access hrf log nutanix password patient portal
## TA_MHSPHP_PRESENTERS.txt 3 0 0 0 0 4 0
## TA_MHSPHP_USERS.txt 2 0 0 0 0 36 0
## tickets.txt 451 91 158 72 75 43 204
## Terms
## Docs report unable user
## TA_MHSPHP_PRESENTERS.txt 4 0 1
## TA_MHSPHP_USERS.txt 18 1 4
## tickets.txt 48 466 989
Notice how the sparsity rating is 60%. What this means is that 60% of the terms would be considered sparse. We can remove sparse terms and create a sparse matrix by using the removeSparseTerms function. You will see below that the sparsity term flips to zero and the number of terms slims down from 2348 to 192. The most obvious thing is that you’ve lost terms that had zeros in at least one document.
# convert to sparse matrix
dtms <- removeSparseTerms(dtm, 0.2)
# visualize the document term matrix
inspect(dtms)
## <<DocumentTermMatrix (documents: 3, terms: 64)>>
## Non-/sparse entries: 192/0
## Sparsity : 0%
## Maximal term length: 10
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs able access can patient phi report reports see
## TA_MHSPHP_PRESENTERS.txt 1 3 9 4 2 4 1 3
## TA_MHSPHP_USERS.txt 4 2 24 36 2 18 4 12
## tickets.txt 49 451 27 43 35 48 48 35
## Terms
## Docs thank user
## TA_MHSPHP_PRESENTERS.txt 2 1
## TA_MHSPHP_USERS.txt 34 4
## tickets.txt 3 989
6 Create Summary Text
Now we can easily proceed to sum the columns to see what the frequency counts are for each term. I will do this for the overall dtm and for each document specific dtm. By the way, I created the document specific dtm so I could create a wordcloud for each document if I wanted to.
# determine frequency of words for all documents
freq_ALL <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
head(freq_ALL, 20)
## user unable access portal log hrf patient password
## 994 467 456 204 158 91 83 75
## nutanix report get site account can able reports
## 72 70 69 67 66 60 54 53
## data issue reset see
## 52 52 51 50
# determine frequency of presenter words
freq_PR <- sort(colSums(as.matrix(dtm_PR)), decreasing=TRUE)
head(freq_PR, 20)
## can question lights make patient report traffic
## 9 5 4 4 4 4 4
## yes access added dont good know lol
## 4 3 3 3 3 3 3
## meds presenter see thx active ago
## 3 3 3 3 2 2
# determine frequency of user words
freq_UR <- sort(colSums(as.matrix(dtm_UR)), decreasing=TRUE)
head(freq_UR, 20)
## patient thank can will report registry yes data
## 36 34 24 20 18 15 15 13
## see sound time just like need opioid run
## 12 10 9 8 8 8 8 8
## email info look mtf
## 7 7 7 7
Looking at the user comments most frequent terms, you can see how a document with a larger volume of terms like the trouble tickets dataset can skew the results (much along the lines of sampling and weighting in statistics). The most frequent term in the user comments dataset (‘patient’) is not even in the top 20 of the combined data set word frequency. This is why I decided to split out each document into its own document term matrix. The difference will become more profound when we create the wordclouds.
# determine frequency of trouble ticket words
freq_TT <- sort(colSums(as.matrix(dtm_TT)), decreasing=TRUE)
head(freq_TT, 20)
## user unable access portal log hrf password nutanix
## 989 466 451 204 158 91 75 72
## account get site issue reset able needs report
## 66 65 64 51 51 49 48 48
## reports requires trying patient
## 48 48 48 43
## find frequent terms for the complete dtm
findMostFreqTerms(dtm, lowfreq = 20)
## $TA_MHSPHP_PRESENTERS.txt
## can question lights make patient report
## 9 5 4 4 4 4
##
## $TA_MHSPHP_USERS.txt
## patient thank can will report registry
## 36 34 24 20 18 15
##
## $tickets.txt
## user unable access portal log hrf
## 989 466 451 204 158 91
The last comment I’ll make in this section is that you can see that the tm package has a findMostFreqTerms function which is very nice.
7 Visualization
We are at the final step in path 1. I am generating a wordcloud for this visual but there are so many more visuals you can create. I happen to like the wordcloud but will definitely provide other examples in the future. For ideas on graphs that are possible, I highly recommend the book – Text Mining with R by Julia Silge and David Robinson.
library(RColorBrewer)
library(wordcloud)
# set grid to display wordclouds as 1 row and 2 columns
par(mfrow=c(1,2))
# get word cloud of trouble tickets comments
set.seed(1234)
wordcloud(names(freq_TT), freq_TT,
scale = c(4, 0.5),
min.freq = 5,
colors = brewer.pal(6, "Dark2"))
# get word cloud of user comments
set.seed(1235)
wordcloud(names(freq_UR), freq_UR,
scale = c(4, 0.5),
min.freq = 5,
colors = brewer.pal(6, "Dark2"))
1 Using the tidytext package - path 2
This section should be shorter because the tidytext
package provides a more efficient path to summary text. I really like how this package tightly incorporates with other packages developed by Hadley Wickham (tidyr
, dplyr
, ggplot2
, etc). The path we will take is outlined below.
First step, as always, is to load the packages you intend to use. I will be loading dplyr
, ggplot2
, and tidytext
.
# load tidytext
library(readtext)
library(tidytext)
library(dplyr)
library(tidyr)
library(ggplot2)
data("stop_words")
2 Creating a tidytext tibble
There are actually a couple of ways you can do this – one is to simply take the document term matrix we built earlier and convert it using the tidy command. The other option below is to pull the data into dataframe and apply the unnest_tokens command.
# convert sparse document term matrix into tidy text
docs_td <- tidy(dtms)
docs_td
## # A tibble: 192 x 3
## document term count
## <chr> <chr> <dbl>
## 1 TA_MHSPHP_PRESENTERS.txt able 1
## 2 TA_MHSPHP_PRESENTERS.txt access 3
## 3 TA_MHSPHP_PRESENTERS.txt added 3
## 4 TA_MHSPHP_PRESENTERS.txt ago 2
## 5 TA_MHSPHP_PRESENTERS.txt ahlta 1
## 6 TA_MHSPHP_PRESENTERS.txt already 1
## 7 TA_MHSPHP_PRESENTERS.txt back 1
## 8 TA_MHSPHP_PRESENTERS.txt can 9
## 9 TA_MHSPHP_PRESENTERS.txt chcs 1
## 10 TA_MHSPHP_PRESENTERS.txt contains 1
## # ... with 182 more rows
# convert document dtms into tidy text
PR_tidy <- tidy(dtm_PR)
UR_tidy <- tidy(dtm_UR)
TT_tidy <- tidy(dtm_TT)
# view results from UR_tidy
head(UR_tidy)
## # A tibble: 6 x 3
## document term count
## <chr> <chr> <dbl>
## 1 TA_MHSPHP_USERS.txt ability 1
## 2 TA_MHSPHP_USERS.txt able 4
## 3 TA_MHSPHP_USERS.txt abuse 2
## 4 TA_MHSPHP_USERS.txt access 2
## 5 TA_MHSPHP_USERS.txt accessing 1
## 6 TA_MHSPHP_USERS.txt acg 2
# second option of creating tidytext
new_docs <- as.data.frame(readtext("~/TA-data-science-club/docs"))
tidy_docs <- new_docs %>% unnest_tokens(word, text)
3 Normalize the data
This is where the efficiency of tidytext
comes in. You need to make sure you have dplyr
, and tidyr
are loaded so you can use the pipe (%>%)
delimiter to chain your code together. In four lines of code, we create a tibble (dataset) that has a term per document per row – and then removing stop words and special words as we did before.
# get special word list ready for anti-join by creating dataframe out of word vector
rm_add_words <- as.data.frame(rm_add_words, stringsAsFactors = FALSE)
names(rm_add_words) <- "word"
# create tidytext of users data
tidy_docs <- new_docs %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word") %>%
anti_join(rm_add_words, by = "word")
# view first rows of tidy_docs
head(tidy_docs)
## doc_id word
## 1 tickets.txt 240
## 2 tickets.txt 236
## 3 TA_MHSPHP_PRESENTERS.txt morning
## 4 TA_MHSPHP_USERS.txt morning
## 5 tickets.txt morning
## 6 tickets.txt morning
4 Create Summary Text
This portion is also very efficient. Pretty much one line of code for summary counts.
# determine most common words in data
tidy_docs %>% count(word, sort = TRUE)
## # A tibble: 2,988 x 2
## word n
## <chr> <int>
## 1 user 995
## 2 unable 467
## 3 access 457
## 4 portal 204
## 5 log 158
## 6 cp 134
## 7 hrf 91
## 8 password 75
## 9 site 73
## 10 nutanix 72
## # ... with 2,978 more rows
5 Sentiment Analysis
The first step of sentiment analysis is to understand what sentiment lexicons are available in tidytext
. I list them out so you can see the different approaches they take. Some simply create a binary variable (positive & negative), others create a distribution of words that correlate with different emotions, and one gives you a quantitative assessment on the severity of the negative or positivity of the words.
# explore sentiments
afinn <- get_sentiments(lexicon = "afinn")
bing <- get_sentiments(lexicon = "bing")
ncr <- get_sentiments(lexicon = "nrc")
loughran <- get_sentiments(lexicon = "loughran")
# list out lexicons
afinn
## # A tibble: 2,476 x 2
## word score
## <chr> <int>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,466 more rows
bing
## # A tibble: 6,788 x 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 a+ positive
## 4 abnormal negative
## 5 abolish negative
## 6 abominable negative
## 7 abominably negative
## 8 abominate negative
## 9 abomination negative
## 10 abort negative
## # ... with 6,778 more rows
ncr
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
loughran
## # A tibble: 4,149 x 2
## word sentiment
## <chr> <chr>
## 1 abandon negative
## 2 abandoned negative
## 3 abandoning negative
## 4 abandonment negative
## 5 abandonments negative
## 6 abandons negative
## 7 abdicated negative
## 8 abdicates negative
## 9 abdicating negative
## 10 abdication negative
## # ... with 4,139 more rows
There is great flexibility with this package to create specific sentiment filters. For example, if I decided to look for the sentiment of ‘joy’ expressed by the users of our information system from the comments – what would that look like? In this example, you need to create a list of ‘joy’ words by filtering the lexicon for ‘joy’, and then do an inner-join with the tidy-docs text based on the common words. You will need the dplyr
package to do the join.
# explore joy sentiment for user comments
nrcjoy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
# filter docs for MHSPHP user comments & look for joy words
tidy_docs %>%
filter(doc_id == "TA_MHSPHP_USERS.txt") %>%
inner_join(nrcjoy, by = "word") %>%
count(word, sort = TRUE)
## # A tibble: 13 x 2
## word n
## <chr> <int>
## 1 helpful 3
## 2 advance 2
## 3 cash 2
## 4 create 2
## 5 green 2
## 6 love 2
## 7 pay 2
## 8 charity 1
## 9 friendly 1
## 10 music 1
## 11 perfect 1
## 12 save 1
## 13 score 1
6 Visualization
Finally, with this particular project, we will visualize the volume of positive and negative words using ggplot2
by graphing a basic bar chart. The midline of ‘0’ is the neutral line, and anything below it is negative, and anything above is positive. It comes as no surprise that the trouble tickets data had an overwhelmingly negative sentiment, while the other two were positive.
library(tidyr)
comment_docs <- tidy_docs %>%
inner_join(get_sentiments("nrc"), by = "word") %>%
count(word, index = doc_id, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
# graph sentiments of comments
library(ggplot2)
ggplot(comment_docs, aes(index, sentiment, fill = index)) +
geom_col(show.legend = FALSE) +
facet_wrap(~index, ncol = 3, scales = "free_x")
7 Conclusion
This was a bit of a long post – so congratulations if you made it this far. However, we have only scratched the surface on the possible. My code is available via github if you want to simply run the code, and the sample datasets are available as well. There is sooo much more that can be done – such as topic models, correlational scatterplots, sentiment clouds, and more. I’ll probably cover these in a future blog post. Thanks again for visiting and feel free to contact me or leave a comment if you have any feedback on this project.