Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently checking value of other row in data.table

Tags:

r

data.table

Note: This is a question that I originally posted to the data.table help group. Matt Dowle asked for a more detailed example and I posted this one, but I had trouble with the formatting in e-mail. I already know how to format on SO, so I thought I would post it here instead.

What I am basically trying to do is subset rows from a data.table based on a value in that row as well as a value in a preceding or following row. Right now, I am creating new columns for future and past rows and then keying the data.table on these columns, but this is resource-intensive and onerous.

The below example illustrates the approach I am using now. The example uses words in documents (I use numeric indices for both). I want to subset for a particular word, but only if it is preceded or followed by another word or set of words:

I first create a dummy dataset with ten documents containing one million words. There are three unique words in the set.

library(data.table)
set.seed(1000)
DT<-data.table(wordindex=sample(1:3,1000000,replace=T),docindex=sample(1:10,1000000,replace=T))
setkey(DT,docindex)
DT[,position:=seq.int(1:.N),by=docindex]


          wordindex docindex position
      1:         1        1        1
      2:         1        1        2
      3:         3        1        3
      4:         3        1        4
      5:         1        1        5
    ---                            
 999996:         2       10    99811
 999997:         2       10    99812
 999998:         3       10    99813
 999999:         1       10    99814
1000000:         3       10    99815

Note that simply counting the occurrences of the first unique word across all documents is easy and beautiful.

setkey(DT,wordindex)
count<-DT[J(1),list(count.1=.N),by=docindex]
count

    docindex count.1
 1:        1   33533
 2:        2   33067
 3:        3   33538
 4:        4   33053
 5:        5   33231
 6:        6   33002
 7:        7   33369
 8:        8   33353
 9:        9   33485
10:       10   33225

It gets messier when taking the position ahead into account. This is a query to count the occurrences of the first unique word across all documents unless it is followed by the second unique word. First I create a new column containing the word one position ahead and then key on both words.

setkey(DT,docindex,position)
DT[,lead_wordindex:=DT[list(docindex,position+1)][,wordindex]]

         wordindex docindex position lead_wordindex
      1:         1        1        1              1
      2:         1        1        2              3
      3:         3        1        3              3
      4:         3        1        4              1
      5:         1        1        5              2
     ---                                           
 999996:         2       10    99811              2
 999997:         2       10    99812              3
 999998:         3       10    99813              1
 999999:         1       10    99814              3
1000000:         3       10    99815             NA

setkey(DT,wordindex,lead_wordindex)
countr2<-DT[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]
countr2

    docindex count.1
 1:        1   22301
 2:        2   21835
 3:        3   22490
 4:        4   21830
 5:        5   22218
 6:        6   21914
 7:        7   22370
 8:        8   22265
 9:        9   22211
10:       10   22190

I have a very large dataset for which the above query fails for memory allocation. As an alternative, we can create this new column for only the relevant subset of data by filtering the original dataset and then joining it back on the desired position:

setkey(DT,wordindex)
filter<-DT[J(1),list(wordindex,docindex,position)]
filter[,lead_position:=position+1]

        wordindex wordindex docindex position lead_position
     1:         1         1        2    99717         99718
     2:         1         1        3    99807         99808
     3:         1         1        4   100243        100244
     4:         1         1        1        1             2
     5:         1         1        1       42            43
    ---                                                    
332852:         1         1       10    99785         99786
332853:         1         1       10    99787         99788
332854:         1         1       10    99798         99799
332855:         1         1       10    99804         99805
332856:         1         1       10    99814         99815

setkey(DT,docindex,position)
filter[,lead_wordindex:=DT[J(filter[,list(docindex,lead_position)])][,wordindex]]

        wordindex wordindex docindex position lead_position lead_wordindex
     1:         1         1        2    99717         99718             NA
     2:         1         1        3    99807         99808             NA
     3:         1         1        4   100243        100244             NA
     4:         1         1        1        1             2              1
     5:         1         1        1       42            43              1
    ---                                                                   
332852:         1         1       10    99785         99786              3
332853:         1         1       10    99787         99788              3
332854:         1         1       10    99798         99799              3
332855:         1         1       10    99804         99805              3
332856:         1         1       10    99814         99815              3

setkey(filter,wordindex,lead_wordindex)
countr2.1<-filter[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]
countr2.1

    docindex count.1
 1:        1   22301
 2:        2   21835
 3:        3   22490
 4:        4   21830
 5:        5   22218
 6:        6   21914
 7:        7   22370
 8:        8   22265
 9:        9   22211
10:       10   22190

Pretty ugly, I think. In addition, I may want to look more than one word ahead, necessitating the creation of yet another column. The easy but costly way is:

setkey(DT,docindex,position)
DT[,lead_lead_wordindex:=DT[list(docindex,position+2)][,wordindex]]

         wordindex docindex position lead_wordindex lead_lead_wordindex
      1:         1        1        1              1                   3
      2:         1        1        2              3                   3
      3:         3        1        3              3                   1
      4:         3        1        4              1                   2
      5:         1        1        5              2                   3
     ---                                                               
 999996:         2       10    99811              2                   3
 999997:         2       10    99812              3                   1
 999998:         3       10    99813              1                   3
 999999:         1       10    99814              3                  NA
1000000:         3       10    99815             NA                  NA

setkey(DT,wordindex,lead_wordindex,lead_lead_wordindex)
countr23<-DT[J(1,2,3),list(count.1=.N),by=docindex]
countr23

    docindex count.1
 1:        1    3684
 2:        2    3746
 3:        3    3717
 4:        4    3727
 5:        5    3700
 6:        6    3779
 7:        7    3702
 8:        8    3756
 9:        9    3702
10:       10    3744

However, I currently have to use the ugly filter-and-join way because of size.

So the question is, is there an easier and more beautiful way?

UPDATE:

Thanks to Arun and eddi for clean and simple code that solves the problem. On my ~200M row data, this solution works on a simple combination of words in about 10 seconds, which is quite good!

I do have an added issue, however, that makes the vector scan approach less than optimal. Although in the example I am looking for only one word combination, in practice I may have a vector of words to look for in each position. When I change the "==" statements to "%in%" for that purpose (vectors of 100 words or more), the query takes much longer to run. So I would still be interested in a binary search solution if one exists. However, if Arun doesn't know of one, it might not, and I would happily accept his answer.

like image 275
Matt Avatar asked Jul 03 '14 14:07

Matt


2 Answers

Just create a lead function and use it in your j-expression:

lead <- function(x, n)
    if (n == 0) x else c(tail(x, -n), rep.int(NA, n))

If you'd like to get the count where wordindex at ith position is 1L and i+1th is not 2L, then:

DT[, sum(wordindex == 1L & lead(wordindex, 1L) != 2L, na.rm=TRUE), by=docindex]
#     docindex    V1
#  1:        1 22301
#  2:        2 21835
#  3:        3 22490
#  4:        4 21830
#  5:        5 22218
#  6:        6 21914
#  7:        7 22370
#  8:        8 22265
#  9:        9 22211
# 10:       10 22190

If you'd like to get counts where wordindex at i is 1L, i+1 is 2L and i+2 is 3L, then:

DT[, sum(wordindex == 1L & lead(wordindex, 1L) == 2L & 
          lead(wordindex, 2L) == 3L, na.rm=TRUE), by=docindex]
#     docindex   V1
#  1:        1 3684
#  2:        2 3746
#  3:        3 3717
#  4:        4 3727
#  5:        5 3700
#  6:        6 3779
#  7:        7 3702
#  8:        8 3756
#  9:        9 3702
# 10:       10 3744

Note that there's no need to setkey here.. adhoc-by should work great.


To address the comment:

This solution uses a vector-scan in j as opposed to your binary search approach. But there are different trade-offs here. The code is relatively elegant, easy to read, extend to multiple lags and conditions and maintain compared to the binary search version (as I can't think of a way without creating extra columns). This requires much lesser memory, which is also a limiting factor in your case.

You say "large data", but don't say anything more. Vector scans on entire data (say 20 million rows or 200 million rows) are costly, yes. But, operating on each group, even if it wouldn't give the performance of binary search, shouldn't be much slower. Of course this again depends on the number of groups you've and the number of observations per group. But it's better to benchmark these things and figure out.

I'll leave you to it. Good luck :).

like image 111
Arun Avatar answered Nov 07 '22 01:11

Arun


It sounds like you simply want:

DT[, sum(wordindex == 1 & c(tail(wordindex, -1), 2) != 2), by = docindex]

I don't see the point of complicating it via joins.

Btw you will get a different answer from yours in some cases, which is either because I don't understand what you want or because your method fails on some edge cases. E.g. try both methods for

DT = data.table(wordindex = c(1,1,2,1,1,2), docindex = c(1,1,2,2,3,3))
like image 38
eddi Avatar answered Nov 07 '22 02:11

eddi