Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing non-English text from Corpus in R using tm()

Tags:

r

tm

I am using tm() and wordcloud() for some basic data-mining in R, but am running into difficulties because there are non-English characters in my dataset (even though I've tried to filter out other languages based on background variables.

Let's say that some of the lines in my TXT file (saved as UTF-8 in TextWrangler) look like this:

Special
satisfação
Happy
Sad
Potential für

I then read my txt file into R:

words <- Corpus(DirSource("~/temp", encoding = "UTF-8"),readerControl = list(language = "lat"))

This yields the warning message:

Warning message:
In readLines(y, encoding = x$Encoding) :
  incomplete final line found on '/temp/file.txt'

But since it's a warning, not an error, I continue to push forward.

words <- tm_map(words, stripWhitespace)
words <- tm_map(words, tolower)

This then yields the error:

Error in FUN(X[[1L]], ...) : invalid input 'satisfa��o' in 'utf8towcs'

I'm open to finding ways to filter out the non-English characters either in TextWrangler or R; whatever is the most expedient. Thanks for your help!

like image 214
roody Avatar asked Aug 09 '13 18:08

roody


2 Answers

Here's a method to remove words with non-ASCII characters before making a corpus:

# remove words with non-ASCII characters
# assuming you read your txt file in as a vector, eg. 
# dat <- readLines('~/temp/dat.txt')
dat <- "Special,  satisfação, Happy, Sad, Potential, für"
# convert string to vector of words
dat2 <- unlist(strsplit(dat, split=", "))
# find indices of words with non-ASCII characters
dat3 <- grep("dat2", iconv(dat2, "latin1", "ASCII", sub="dat2"))
# subset original vector of words to exclude words with non-ASCII char
dat4 <- dat2[-dat3]
# convert vector back to a string
dat5 <- paste(dat4, collapse = ", ")
# make corpus
require(tm)
words1 <- Corpus(VectorSource(dat5))
inspect(words1)

A corpus with 1 text document

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
Special, Happy, Sad, Potential
like image 179
Ben Avatar answered Sep 20 '22 20:09

Ben


You can also use the package "stringi".

Using the above example:

library(stringi)
dat <- "Special,  satisfação, Happy, Sad, Potential, für"
stringi::stri_trans_general(dat, "latin-ascii")

Output:

[1] "Special,  satisfacao, Happy, Sad, Potential, fur"  
like image 38
Wilfredo Avatar answered Sep 20 '22 20:09

Wilfredo