Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R tm package: utf-8 text

Tags:

r

utf-8

tm

I would like to create a wordcloud for non-english text in utf-8 (actually, it's in kazakh language).

The text is displayed absolutely right in inspect function of the tm package. However, when I search for word frequency everything is displayed incorrectly:

The problem is that the text is displayed with coded characters instead of words. Cyrillic characters are displayed correctly. Consquently the wordcloud becomes a complete mess.

Is it possible to assign encoding to the tm function somehow? I tried this, but the text on its own is fine, the problem is with using tm package.

Let a sample text be:

Ол арман – әлем елдерімен терезесі тең қатынас құрып, әлем картасынан ойып тұрып орын алатын Тәуелсіз Мемлекет атану еді. Ол арман – тұрмысы бақуатты, түтіні түзу ұшқан, ұрпағы ертеңіне сеніммен қарайтын бақытты Ел болу еді. Біз армандарды ақиқатқа айналдырдық. Мәңгілік Елдің іргетасын қаладық. Мен қоғамда «Қазақ елінің ұлттық идеясы қандай болуы керек?» деген сауал жиі талқыға түсетінін көріп жүрмін. Біз үшін болашағымызға бағдар ететін, ұлтты ұйыстырып, ұлы мақсаттарға жетелейтін идея бар. Ол – Мәңгілік Ел идеясы. Тәуелсіздікпен бірге халқымыз Мәңгілік Мұраттарына қол жеткізді.

My simple code is this: (Based on onertipaday.blogspot.com tutorials:)

require(tm)
require(wordcloud)
text<-readLines("text.txt", encoding="UTF-8")
ap.corpus <- Corpus(DataframeSource(data.frame(text)))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, tolower)
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)

1  2 
44  4 

findFreqTerms(ap.tdm, lowfreq=2)

[1] "<U+04D9>лем"            "арман"                  "еді"                   
[4] "м<U+04D9><U+04A3>гілік"

Those words should be: "Әлем", арман", "еді", "мәңгілік". They are displayed correctly in inspect(ap.corpus) output.

Highly appreciate any help! :)

like image 763
Asayat Avatar asked Jan 21 '14 07:01

Asayat


People also ask

What package is TM in R?

tm: Text Mining Package A framework for text mining applications within R.

What is UTF-8 package?

utf8 is an R package for manipulating and printing UTF-8 text that fixes multiple bugs in R's UTF-8 handling.

What is TM function in R?

In R, the tm package is often used to create a corpus object. This package can be used to read in data in many different formats– including text within data frames, .txt files, or .doc files. Let's begin with an example of how to read in text from within a data frame.

What is UTF-8 encoded text?

UTF-8 is a Unicode character encoding method. This means that UTF-8 takes the code point for a given Unicode character and translates it into a string of binary. It also does the reverse, reading in binary digits and converting them back to characters.


1 Answers

The problem comes from the default tokenizer. tm by default uses scan_tokenizer which it looses encoding(maybe you should contact the maintainer to add an encoding argument).

scan_tokenizer function (x) { scan(text = x, what = "character", quote = "", quiet = TRUE) }

One solution is to provide your own tokenizer to create the matrix terms. I am using strsplit:

scanner <- function(x) strsplit(x," ")
ap.tdm <- TermDocumentMatrix(ap.corpus,control=list(tokenize=scanner))

Then you get the result well encoded:

findFreqTerms(ap.tdm, lowfreq=2)
[1] "арман"    "біз"      "еді"      "әлем"     "идеясы"   "мәңгілік"
like image 51
agstudy Avatar answered Oct 01 '22 06:10

agstudy