Text processing

In this lab you’ll learn some basic text processing following what was presented in class and do a little exploratory analysis using token frequency measures.

Objectives

Action

Setup for lab

Open RStudio.

  1. We’ll need a few additional packages that weren’t installed in the first lab. In the console, execute the following commands:
url <- 'https://raw.githubusercontent.com/pstat197/pstat197a/main/materials/scripts/package-installs.R'

source(url)
  1. Create a new script in your lab directory and copy-paste the code chunk below. Execute once.
# setup
library(tidyverse)
library(tidytext)
library(tokenizers)
library(textstem)
library(stopwords)
url <- 'https://raw.githubusercontent.com/pstat197/pstat197a/main/materials/labs/lab5-text/data/drseuss.txt'

# read data
seuss_lines <- read_lines(url, skip_empty_rows = T)

The text we’ll work with comprises four Dr. Seuss books. The raw data are read in line-by-line, so that seuss_lines is a vector in which each element is a line from one of the four books. Lines are rendered in order.

seuss_lines %>% head()
[1] "The Cat in the Hat"            "By Dr. Seuss"                 
[3] "The sun did not shine."        "It was too wet to play."      
[5] "So we sat in the house"        "All that cold, cold, wet day."

Text preprocessing

For us, ‘preprocessing’ operations will refer to coercing a document into one long uniformly-formatted string.

Distinguishing documents

To start, we have all four books lumped together. A quick visual scan of the text file will confirm that each book is set off by the title on one line followed by ‘By Dr. Seuss’ on the next line.

We can leverage this structure to distinguish the books: the chunk below

  • creates a ‘flag’ by pattern-matching each line with Dr. Seuss,

  • then shifts the lines down by one so that the flag matches the title line instead of the author line

  • then assigns a document ID to each line by computing the total number of flags in all preceding lines.

The last two commands correct for having ‘lagged’ the lines.

# flag lines with a document id
seuss_lines_df <- tibble(line_lag = c(seuss_lines, NA)) %>%
  mutate(flag = str_detect(line_lag, 'Dr. Seuss'),
         line = lag(line_lag, n = 1),
         doc = cumsum(flag)) %>% 
  select(doc, line) %>%
  slice(-1) %>%
  fill(doc)

We may as well assign labels to the document IDs.

# grab titles
titles <- seuss_lines_df %>% 
  group_by(doc) %>%
  slice_head() %>%
  pull(line) %>%
  tolower()

# label docs
seuss_lines_df <- seuss_lines_df %>%
  mutate(doc = factor(doc, labels = titles))

Finally, we’ll strip the title and author information, because all books are by the same author and the title is now recorded in the document ID.

The chunk below adds a document-specific line number and removes the first two lines from every document. Since each row is a line, this amounts to a simple row numbering and filtering.

# remove header lines (title/author)
seuss_lines_clean <- seuss_lines_df %>%
  group_by(doc) %>%
  mutate(line_num = row_number() - 2) %>%
  filter(line_num > 0)
Action

Line summaries

See if you can answer the following questions:

  1. How many lines are in each book?
  2. How many lines in each book contain the word ‘bump’?

Work with a neighbor. Hint: you might find it handy to use str_detect() , and grouped operations and/or summaries.

Collapsing lines and cleaning text

First, concatenate all the lines using str_c() .

# collapse lines into one long string
seuss_text <- seuss_lines_clean %>% 
  summarize(text = str_c(line, collapse = ' '))

In this case the resulting text strings for each document don’t contain too many elements in need of removal: just punctuation and capital letters.

cat_in_hat <- seuss_text %>% slice(1) %>% pull(text)

To strip these elements, we can exclude matching patterns from the collection of punctuation marks and then use tolower() to replace upper-case letters with lower-case letters. Shorthand for punctuation in stringr is '[[:punct:]]' .

cat_in_hat %>%
  str_remove_all('[[:punct:]]') %>%
  tolower()
[1] "the sun did not shine it was too wet to play so we sat in the house all that cold cold wet day i sat there with sally we sat there we two and i said how i wish we had something to do too wet to go out and too cold to play ball so we sat in the house we did nothing at all so all we could do was to sit sit sit sit and we did not like it not one little bit bump and then something went bump how that bump made us jump we looked then we saw him step in on the mat we looked and we saw him the cat in the hat and he said to us why do you sit there like that i know it is wet and the sun is not sunny but we can have lots of good fun that is funny i know some good games we could play said the cat i know some new tricks said the cat in the hat a lot of good tricks i will show them to you your mother will not mind at all if i do then sally and i did not know what to say our mother was out of the house for the day but our fish said no no make that cat go away tell that cat in the hat you do not want to play he should not be here he should not be about he should not be here when your mother is out now now have no fear have no fear said the cat my tricks are not bad said the cat in the hat why we can have lots of good fun if you wish with a game that i call upupup with a fish put me down said the fish this is no fun at all put me down said the fish i do not wish to fall have no fear said the cat i will not let you fall i will hold you up high as i stand on a ball with a book on one hand and a cup on my hat but that is not all i can do said the cat look at me look at me now said the cat with a cup and a cake on the top of my hat i can hold up two books i can hold up the fish and a litte toy ship and some milk on a dish and look i can hop up and down on the ball but that is not all oh no that is not all look at me look at me look at me now it is fun to have fun but you have to know how i can hold up the cup and the milk and the cake i can hold up these books and the fish on a rake i can hold the toy ship and a little toy man and look with my tail i can hold a red fan i can fan with the fan as i hop on the ball but that is not all oh no that is not all that is what the cat said then he fell on his head he came down with a bump from up there on the ball and sally and i we saw all the things fall and our fish came down too he fell into a pot he said do i like this oh no i do not this is not a good game said our fish as he lit no i do not like it not one little bit now look what you did said the fish to the cat now look at this house look at this look at that you sank our toy ship sank it deep in the cake you shook up our house and you bent our new rake you should not be here when our mother is not you get out of this house said the fish in the pot but i like to be here oh i like it a lot said the cat in the hat to the fish in the pot i will not go away i do not wish to go and so said the cat in the hat so so so i will show you another good game that i know and then he ran out and then fast as a fox the cat in the hat came back in with a box a big red wood box it was shut with a hook now look at this trick said the cat take a look then he got up on top with a tip of his hat i call this game funinabox said the cat in this box are two things i will show to you now you will like these two things said the cat with a bow i will pick up the hook you will see something new two things and i call them thing one and thing two these things will not bite you they want to have fun then out of the box came thing two and thing one and they ran to us fast they said how do you do would you like to shake hands with thing one and thing two and sally and i did not know what to do so we had to shake hands with thing one and thing two we shook their two hands but our fish said no no those things should not be in this house make them go they should not be here when your mother is not put them out put them out said the fish in the pot have no fear little fish said the cat in the hat these things are good things and he gave them a pat they are tame oh so tame they have come here to play they will give you some fun on this wet wet wet day now here is a game that they like said the cat they like to fly kites said the cat in the hat no not in the house said the fish in the pot they should not fly kites in a house they should not oh the things they will bump oh the things they will hit oh i do not like it not one little bit then sally and i saw them run down the hall we saw those two things bump their kites on the wall bump thump thump bump down the wall in the hall thing two and thing one they ran up they ran down on the string of one kite we saw mothers new gown her gown with the dots that are pink white and red then we saw one kite bump on the head of her bed then those things ran about with big bumps jumps and kicks and with hops and big thumps and all kinds of bad tricks and i said i do not like the way that they play if mother could see this oh what would she say then our fish said look look and our fish shook with fear your mother is on her way home do you hear oh what will she do to us what will she say oh she will not like it to find us this way so do something fast said the fish do you hear i saw her your mother your mother is near so as fast as you can think of something to do you will have to get rid of thing one and thing two so as fast as i could i went after my net and i said with my net i can get them i bet i bet with my net i can get those things yet then i let down my net it came down with a plop and i had them at last thoe two things had to stop then i said to the cat now you do as i say you pack up those things and you take them away oh dear said the cat you did not like our game oh dear what a shame what a shame what a shame then he shut up the things in the box with the hook and the cat went away with a sad kind of look that is good said the fish he has gone away yes but your mother will come she will find this big mess and this mess is so big and so deep and so tall we ca not pick it up there is no way at all and then who was back in the house why the cat have no fear of this mess said the cat in the hat i always pick up all my playthings and so i will show you another good trick that i know then we saw him pick up all the things that were down he picked up the cake and the rake and the gown and the milk and the strings and the books and the dish and the fan and the cup and the ship and the fish and he put them away then he said that is that and then he was gone with a tip of his hat then our mother came in and she said to us two did you have any fun tell me what did you do and sally and i did not know what to say should we tell her the things that went on there that day should we tell her about it now what should we do well what would you do if your mother asked you"

To apply this to all four texts, simply create a function wrapper for the processing commands and then use dplyr to pass the text through the processing function.

clean_fn <- function(.text){
  str_remove_all(.text, '[[:punct:]]') %>% tolower()
}

seuss_text_clean <- seuss_text %>%
  mutate(text = clean_fn(text))

You could also create a manual list of punctuation to remove.

Action

The regular expression for matching a or b is a | b . Write an alternative to the previous code chunk that lists the punctuation to remove explicitly and does not use '[[:punct:]]' .

Basic NLP

As you saw in class, once we have a string of clean text for each document, tokenization and lemmatization are largely automated.

Tokenization

unnest_tokens() will tokenize and return the result in tidy format; lemmatize_words() can be applied to the resulting column of tokens using dplyr commands.

stpwrd <- stop_words %>%
  pull(word) %>%
  str_remove_all('[[:punct:]]')

seuss_tokens_long <- seuss_text_clean %>%
  unnest_tokens(output = token, # specifies new column name
                input = text, # specifies column containing text
                token = 'words', # how to tokenize
                stopwords = stpwrd) %>% # optional stopword removal
  mutate(token = lemmatize_words(token)) 
Action

Based on the data frame above, use row counting (count() ) to answer the following questions:

  1. What’s the most frequently used word in each book?
  2. What’s the most frequently used word in all books?

Compare with your neighbor to check your answers.

If there’s time: refer to the documentation ?unnest_tokens to determine how to tokenize as bigrams. Find the most frequent bigrams in each book.

Frequency measures

The frequency measures discussed in class – term frequency (TF), inverse document frequency (IDF), and their product (TF-IDF) – can be computed from token counts using tidytext::bind_tf_idf() .

seuss_tfidf <- seuss_tokens_long %>%
  count(doc, token) %>%
  bind_tf_idf(term = token,
              document = doc,
              n = n) 

seuss_df <- seuss_tfidf %>%
  pivot_wider(id_cols = doc, 
              names_from = token,
              values_from = tf_idf,
              values_fill = 0)

seuss_df
# A tibble: 4 × 246
  doc         bad    ball     bed    bend     bet    bite   book     bow     box
  <fct>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
1 "the c… 0.00396 0.00990 0.00198 0.00198 0.00792 0.00792 0.0158 0.00396 0.00411
2 "fox i… 0       0       0       0.00891 0       0       0      0       0.00431
3 "green… 0       0       0       0       0       0       0      0       0.0139 
4 "hop o… 0.00788 0.0118  0.0118  0       0       0.00394 0      0       0      
# … with 236 more variables: bump <dbl>, ca <dbl>, cake <dbl>, call <dbl>,
#   cat <dbl>, cold <dbl>, cup <dbl>, day <dbl>, dear <dbl>, deep <dbl>,
#   dish <dbl>, dot <dbl>, fall <dbl>, fan <dbl>, fast <dbl>, fear <dbl>,
#   fish <dbl>, fly <dbl>, fox <dbl>, fun <dbl>, funinabox <dbl>, funny <dbl>,
#   game <dbl>, gown <dbl>, hall <dbl>, hand <dbl>, hat <dbl>, head <dbl>,
#   hear <dbl>, hit <dbl>, hold <dbl>, home <dbl>, hook <dbl>, hop <dbl>,
#   house <dbl>, jump <dbl>, kick <dbl>, kind <dbl>, kite <dbl>, light <dbl>, …

We can use this data to compute a variety of summaries of the text. For example, the two words that distinguish each book most from the other books are, by book:

seuss_tfidf %>%
  group_by(doc) %>%
  slice_max(tf_idf, n = 2)
# A tibble: 10 × 6
# Groups:   doc [4]
   doc                  token      n     tf   idf tf_idf
   <fct>                <chr>  <int>  <dbl> <dbl>  <dbl>
 1 "the cat in the hat" cat       26 0.0743 0.693 0.0515
 2 "the cat in the hat" fish      20 0.0571 0.693 0.0396
 3 "fox in socks "      sir       37 0.0792 1.39  0.110 
 4 "fox in socks "      sock      19 0.0407 1.39  0.0564
 5 "green eggs and ham" samiam    13 0.0897 1.39  0.124 
 6 "green eggs and ham" egg       10 0.0690 1.39  0.0956
 7 "green eggs and ham" green     10 0.0690 1.39  0.0956
 8 "green eggs and ham" ham       10 0.0690 1.39  0.0956
 9 "hop on pop"         brown     10 0.0568 1.39  0.0788
10 "hop on pop"         pup        8 0.0455 1.39  0.0630

But the two most common words in each book are:

seuss_tfidf %>%
  group_by(doc) %>%
  slice_max(tf, n = 2)
# A tibble: 8 × 6
# Groups:   doc [4]
  doc                  token      n     tf   idf tf_idf
  <fct>                <chr>  <int>  <dbl> <dbl>  <dbl>
1 "the cat in the hat" cat       26 0.0743 0.693 0.0515
2 "the cat in the hat" fish      20 0.0571 0.693 0.0396
3 "fox in socks "      sir       37 0.0792 1.39  0.110 
4 "fox in socks "      sock      19 0.0407 1.39  0.0564
5 "green eggs and ham" eat       24 0.166  0.288 0.0476
6 "green eggs and ham" samiam    13 0.0897 1.39  0.124 
7 "hop on pop"         brown     10 0.0568 1.39  0.0788
8 "hop on pop"         pat       10 0.0568 0.693 0.0394
Action

Discuss with your neighbor how you might determine how ‘different’ any two books are using an appropriate frequency measure and comparison between rows.

  1. Compute your difference measure for all pairs of books. Which pair is most distinct?
  2. Use the same idea to compute difference from the ‘average’ Dr. Seuss book. Which book is most different from the rest?