I have been trying to implement the algorithm recently proposed in this paper. Given a large amount of text (corpus), the algorithm is supposed to return characteristic n-grams (i.e., sequence of n words) of the corpus. The user can decide the appropriate n, and at the moment I am trying with n = 2-6 as in the original paper. In other words, using the algorithm, I want to extract 2- to 6-grams that characterize the corpus.
I was able to implement the part that calculates the score based on which characteristic n-grams are identified, but have been struggling to eliminate non-characteristic ones.
I have a list called token.df
that contains five data frames including all the n-grams that appear in the corpus. Each data frame corresponds to each n in n-grams. For example, token.df[[2]]
includes all the bigrams (2-grams) and their scores (called mi below) in the alphabetical order.
> head(token.df[[2]])
w1 w2 mi
_ eos 17.219346
_ global 7.141789
_ what 8.590394
0 0 2.076421
0 00 5.732846
0 000 3.426785
Here, the bigram 0 0 (though they are not quite words as such) has the score of 2.076421. Since the data frames include all the n-grams that appear in the corpus, they each have over one million rows.
> sapply(token.df, nrow)
[[1]]
NULL
[[2]]
[1] 1006059 # number of unique bigrams in the corpus
[[3]]
[1] 2684027 # number of unique trigrams in the corpus
[[4]]
[1] 3635026 # number of unique 4-grams in the corpus
[[5]]
[1] 3965120 # number of unique 5-grams in the corpus
[[6]]
[1] 4055048 # number of unique 6-grams in the corpus
I want to identify which n-grams to retain and which ones to discard. For this purpose, the algorithm does the following.
> token.df[[2]][15, ]
w1 w2 mi
0 001 10.56292
> token.df[[3]][33:38, ]
w1 w2 w3 mi
0 001 also 3.223091
0 001 although 5.288097
0 001 and 2.295903
0 001 but 4.331710
0 001 compared 6.270625
0 001 dog 11.002312
> token.df[[4]][46:48, ]
w1 w2 w3 w4 mi
0 001 compared to 5.527626
0 001 dog walkers 10.916028
0 001 environmental concern 10.371769
Here, the bigram 0 001 is not retained because one of the trigrams whose first two words match the bigram (0 001 dog) has a higher score than the bigram (11.002312 > 10.56292). The trigram 0 001 dog is retained because its score (11.002312) is higher than that of the bigram that matches the first two words of the trigram (0 001; score = 10.56292) and that of the 4-gram whose first three words match the trigram (0 001 dog walkers; score = 10.916028).
What I would like to know is an efficient way to achieve the above. In order to determine which bigrams to retain, for example, I need to find out for each row of token.df[[2]]
which rows in token.df[[3]]
have the first two words identical to the bigram in concern. However, since the number of rows is large, my iteration approaches below take too long time to run. They focus on the case of bigrams because the task looked simpler than the case of 3-5 grams.
The for
loop approach.
Since the code below goes over all the rows of token.df[[3]]
at each iteration, it was estimated to take months to run. Though slightly better, similar was the case with by()
.
# for loop
retain <- numeric(nrow(token.df[[2]]))
for (i in 1:nrow(token.df[[2]])) {
mis <- token.df[[3]]$mi[token.df[[2]][i, ]$w1 == token.df[[3]][ , 1] & token.df[[2]][i, ]$w2 == token.df[[3]][ , 2]]
retain[i] <- ifelse(token.df[[2]]$mi[i] > max(mis), TRUE, FALSE)
}
# by
mis <- by(token.df[[2]], 1:nrow(token.df[[2]]), function(x) token.df[[3]]$mi[x$w1 == token.df[[3]]$w1 & x$w2 == token.df[[3]]$w2])
retain <- sapply(seq(mis), function(i) token.df[[2]]$mi[i] > max(mis[[i]]))
The pointer approach.
The problem with the above is the large number of iterations over a (vertically) long data frame. To alleviate the issue, I thought I can use the fact that n-grams are alphabetically sorted in each data frame and employ a kind of pointer indicating at which row to start looking. However, this approach, too, takes too long to run (at least several days).
retain <- numeric(nrow(token.df[[2]]))
nrow <- nrow(token.df[[3]]) # number of rows of the trigram data frame
pos <- 1 # pointer
for (i in seq(nrow(token.df[[2]]))) {
j <- 1
target.rows <- numeric(10)
while (TRUE) {
if (pos == nrow + 1 || !all(token.df[[2]][i, 1:2] == token.df[[3]][pos, 1:2])) break
target.rows[j] <- pos
pos <- pos + 1
if (j %% 10 == 0) target.rows <- c(target.rows, numeric(10))
j <- j + 1
}
target.rows <- target.rows[target.rows != 0]
retain[i] <- ifelse(token.df[[2]]$mi[i] > max(token.df[[3]]$mi[target.rows]), TRUE, FALSE)
}
Is there a way to do this task within a reasonable amount of time (e.g., overnight)? Now that iteration approaches have been in vain, I am wondering if any vectorization is possible. But I am open to any means to speed up the process.
The data have a tree structure in that one bigram is divided into one or more trigrams, each of which in turn is divided into one or more 4-grams, and so forth. I am not sure how best to process this kind of data.
I thought about putting up part of the real data I'm using, but cutting down the data ruins the whole point of the issue. I assume people do not want to download the whole data set of 250MB just for this, nor do I have a right to upload it. Below is the random data set that is still smaller than that I'm using but helps to experience the problem. With the code above (the pointer approach), it takes my computer 4-5 seconds to process the first 100 rows of token.df[[2]]
below and it presumably takes 12 hours just to process all the bigrams.
token.df <- list()
types <- combn(LETTERS, 4, paste, collapse = "")
set.seed(1)
data <- data.frame(matrix(sample(types, 6 * 1E6, replace = TRUE), ncol = 6), stringsAsFactors = FALSE)
colnames(data) <- paste0("w", 1:6)
data <- data[order(data$w1, data$w2, data$w3, data$w4, data$w5, data$w6), ]
set.seed(1)
for (n in 2:6) token.df[[n]] <- cbind(data[ , 1:n], mi = runif(1E6))
Any ideas to speed up the code are highly appreciated.
The function rbind() is slow, particularly as the data frame gets bigger. You should never use it in a loop. The right way to do it is to initialize the output object at its final size right from the start and then simply fill it in with each turn of the loop.
You can save an R object like a data frame as either an RData file or an RDS file. RData files can store multiple R objects at once, but RDS files are the better choice because they foster reproducible code. To save data as an RData object, use the save function. To save data as a RDS object, use the saveRDS function.
The following runs in under 7 seconds on my machine, for all the bigrams:
library(dplyr)
res <- inner_join(token.df[[2]],token.df[[3]],by = c('w1','w2'))
res <- group_by(res,w1,w2)
bigrams <- filter(summarise(res,keep = all(mi.y < mi.x)),keep)
There's nothing special about dplyr here. An equally fast (or faster) solution could surely be done using data.table or directly in SQL. You just need to switch to using joins (as in SQL) rather than iterating through everything yourself. In fact, I wouldn't be surprised if simply using merge
in base R and then aggregate
wouldn't be orders of magnitude faster than what you're doing now. (But you really should be doing this with data.table, dplyr or directly in a SQL data base).
Indeed, this:
library(data.table)
dt2 <- setkey(data.table(token.df[[2]]),w1,w2)
dt3 <- setkey(data.table(token.df[[3]]),w1,w2)
dt_tmp <- dt3[dt2,allow.cartesian = TRUE][,list(k = all(mi < mi.1)),by = c('w1','w2')][(k)]
is even faster still (~2x). I'm not even really sure that I've squeezed all the speed I could have out of either package, to be honest.
(edit from Rick. Attempted as comment, but syntax was getting messed up)
If using data.table
, this should be even faster, as data.table
has a by-without-by
feature (See ?data.table
for more info):
dt_tmp <- dt3[dt2,list(k = all(mi < i.mi)), allow.cartesian = TRUE][(k)]
Note that when joining data.tables
you can preface the column names with i.
to indicate to use the column from specifically the data.table in the i=
argument.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With