I am using tidytext
package in R
to do n-gram analysis.
Since I analyze tweets, I would like to preserve @ and # to capture mentions, retweets, and hashtags. However, unnest_tokens
function automatically removes all punctuations and convert text into lower case.
I found unnest_tokens
has an option to use regular expression using token='regex'
, so I can customize the way it cleans the text. But, it only works in unigram analysis and it doesn't work with n-gram because I need to define token='ngrams'
to do n-gram analysis.
Is there any way to prevent unnest_tokens
from converting text into lowercase in n-gram analysis?
Arguments for tokenize_words
are available within the unnest_tokens
function call. So you can use strip_punct = FALSE
directly as an argument for unnest_tokens
.
Example:
txt <- data.frame(text = "Arguments for `tokenize_words` are available within the `unnest_tokens` function call. So you can use `strip_punct = FALSE` directly as an argument for `unnest_tokens`. ", stringsAsFactors = F)
unnest_tokens(txt, palabras, "text", strip_punct =FALSE)
palabras
1 arguments
1.1 for
1.2 `
1.3 tokenize_words
1.4 `
1.5 are
1.6 available
1.7 within
1.8 the
1.9 `
1.10 unnest_tokens
1.11 `
1.12 function
1.13 call
1.14 .
1.15 so
#And some more, but you get the point.
Also available: lowercase = FALSE
and strip_numeric = TRUE
to change the default opposite behavior.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With