Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preserve punctuations using unnest_tokens() in tidytext in R

I am using tidytext package in R to do n-gram analysis.

Since I analyze tweets, I would like to preserve @ and # to capture mentions, retweets, and hashtags. However, unnest_tokens function automatically removes all punctuations and convert text into lower case.

I found unnest_tokens has an option to use regular expression using token='regex', so I can customize the way it cleans the text. But, it only works in unigram analysis and it doesn't work with n-gram because I need to define token='ngrams' to do n-gram analysis.

Is there any way to prevent unnest_tokens from converting text into lowercase in n-gram analysis?

like image 929
JungHwan Yang Avatar asked Jun 12 '17 23:06

JungHwan Yang


1 Answers

Arguments for tokenize_words are available within the unnest_tokens function call. So you can use strip_punct = FALSE directly as an argument for unnest_tokens.

Example:

txt <- data.frame(text = "Arguments for `tokenize_words` are available within the `unnest_tokens` function call. So you can use `strip_punct = FALSE` directly as an argument for `unnest_tokens`. ", stringsAsFactors = F)
unnest_tokens(txt, palabras, "text", strip_punct =FALSE)

 palabras
 1         arguments
 1.1             for
 1.2               `
 1.3  tokenize_words
 1.4               `
 1.5             are
 1.6       available
 1.7          within
 1.8             the
 1.9               `
 1.10  unnest_tokens
 1.11              `
 1.12       function
 1.13           call
 1.14              .
 1.15             so
 #And some more, but you get the point. 

Also available: lowercase = FALSE and strip_numeric = TRUE to change the default opposite behavior.

like image 199
mpaladino Avatar answered Nov 09 '22 15:11

mpaladino