Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Maximum reasonable size for stemCompletion in tm?

Tags:

r

tm

I have a corpus of 26 plain text files, each between 12 - 148kb, total of 1.2Mb. I'm using R on a Windows 7 laptop.

I did all the normal cleanup stuff (stopwords, custom stopwords, lower case, numbers) and want to do stem completion. I am using the original corpus as a dictionary as shown in the examples. I tried a couple of simple vectors to make sure it would work at all (with about 5 terms) and it did and very quickly.

exchanger <- function(x) stemCompletion(x, budget.orig)
budget <- tm_map(budget, exchanger)

It's been working since yesterday at 4pm! In R Studio under diagnostics, the request log shows new requests with different request numbers. Task manager shows it using some memory, but not a crazy amount. I don't want to stop it because what if it's almost there? Any other ideas of how to check progress - it's a volatile corpus, unfortunately? Ideas on how long it should take? I thought about using the dtm names vector as a dictionary, cut off at the most frequent (or high tf-idf), but I'm reluctant to kill this process.

This is a regular windows 7 laptop with lots of other things running.

Is this corpus too big for stemCompletion? Short of moving to Python, is there a better way to do stemCompletion or lemmatize vice stem - my web searching has not yielded any answers.

like image 241
ChristinaP Avatar asked Jun 07 '13 15:06

ChristinaP


1 Answers

I can't give you a definite answer without data that reproduces your problem, but I would guess the bottleneck comes from the folllowing line from the stemCompletion source code:

possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE))

After which, given you've kept the completion heuristic on the default of "prevalent", this happens:

possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), decreasing = TRUE))
structure(names(sapply(possibleCompletions, "[", 1)), names = x)

That first line loops through each word in your corpus and checks it against your dictionary for possible completions. I'm guessing you have many words that appear many times in your corpus. That means the function is being called many times only to give the same response. A possibly faster version (depending on how many words were repeats and how often they were repeated) would look something like this:

y <- unique(x)

possibleCompletions <- lapply(y, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE))

possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), decreasing = TRUE))

z <- structure(names(sapply(possibleCompletions, "[", 1)), names = y)

z[match(x, names(z))]

So it only loops through the unique values of x rather than every value of x. To create this revised version of the code, you would need to download the source from CRAN and modify the function (I found it in the completion.R in the R folder).

Or you may just want to use Python for this one.

like image 113
SchaunW Avatar answered Nov 12 '22 02:11

SchaunW