R tm package: utf-8 text

Tags:

I would like to create a wordcloud for non-english text in utf-8 (actually, it's in kazakh language).

The text is displayed absolutely right in inspect function of the tm package. However, when I search for word frequency everything is displayed incorrectly:

The problem is that the text is displayed with coded characters instead of words. Cyrillic characters are displayed correctly. Consquently the wordcloud becomes a complete mess.

Is it possible to assign encoding to the tm function somehow? I tried this, but the text on its own is fine, the problem is with using tm package.

Let a sample text be:

Ол арман – әлем елдерімен терезесі тең қатынас құрып, әлем картасынан ойып тұрып орын алатын Тәуелсіз Мемлекет атану еді. Ол арман – тұрмысы бақуатты, түтіні түзу ұшқан, ұрпағы ертеңіне сеніммен қарайтын бақытты Ел болу еді. Біз армандарды ақиқатқа айналдырдық. Мәңгілік Елдің іргетасын қаладық. Мен қоғамда «Қазақ елінің ұлттық идеясы қандай болуы керек?» деген сауал жиі талқыға түсетінін көріп жүрмін. Біз үшін болашағымызға бағдар ететін, ұлтты ұйыстырып, ұлы мақсаттарға жетелейтін идея бар. Ол – Мәңгілік Ел идеясы. Тәуелсіздікпен бірге халқымыз Мәңгілік Мұраттарына қол жеткізді.

My simple code is this: (Based on onertipaday.blogspot.com tutorials:)

require(tm)
require(wordcloud)
text<-readLines("text.txt", encoding="UTF-8")
ap.corpus <- Corpus(DataframeSource(data.frame(text)))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, tolower)
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)

1  2 
44  4 

findFreqTerms(ap.tdm, lowfreq=2)

[1] "<U+04D9>лем"            "арман"                  "еді"                   
[4] "м<U+04D9><U+04A3>гілік"

Those words should be: "Әлем", арман", "еді", "мәңгілік". They are displayed correctly in inspect(ap.corpus) output.

Highly appreciate any help! :)

763

asked Jan 21 '14 07:01

Asayat

1 Answers

The problem comes from the default tokenizer. tm by default uses scan_tokenizer which it looses encoding(maybe you should contact the maintainer to add an encoding argument).

scan_tokenizer function (x) { scan(text = x, what = "character", quote = "", quiet = TRUE) }

One solution is to provide your own tokenizer to create the matrix terms. I am using strsplit:

scanner <- function(x) strsplit(x," ")
ap.tdm <- TermDocumentMatrix(ap.corpus,control=list(tokenize=scanner))

Then you get the result well encoded:

findFreqTerms(ap.tdm, lowfreq=2)
[1] "арман"    "біз"      "еді"      "әлем"     "идеясы"   "мәңгілік"

answered Oct 01 '22 06:10

agstudy

Related questions
                            
                                R bulk upload data to MYSQL database
                            
                                A iterative and lagging function similar to diff in R, but not just difference?
                            
                                fourier() vs fourierf() function in R
                            
                                R: Gradient plot on a shapefile
                            
                                How to change name of row variable in a table
                            
                                Plot tree with graph.tree function from igraph
                            
                                Automated formula construction
                            
                                Largest possible values in R [duplicate]
                            
                                How to get frequencies then add it as a variable in an array?
                            
                                Extract name hierarchy for each leaf of a nested list
                            
                                R pairwise product
                            
                                Format to find day of week not working in Windows
                            
                                Overlapping Genomic Ranges
                            
                                time series with 10 min frequency in R
                            
                                Partial string matching with grep and regular expressions
                            
                                R, get longitude/latitude data for cities and add it to my dataframe [closed]
                            
                                Evaluate an expression within an environment inside a function
                            
                                How do I make the y-axis values bold in R?
                            
                                Right (or left) side trimmed mean
                            
                                Solve simple equation in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R tm package: utf-8 text

Tags:

r

utf-8

tm

Asayat

People also ask

1 Answers

agstudy

Recent Activity

Donate For Us