Note: This is a question that I originally posted to the data.table help group. Matt Dowle asked for a more detailed example and I posted this one, but I had trouble with the formatting in e-mail. I already know how to format on SO, so I thought I would post it here instead.
What I am basically trying to do is subset rows from a data.table based on a value in that row as well as a value in a preceding or following row. Right now, I am creating new columns for future and past rows and then keying the data.table on these columns, but this is resource-intensive and onerous.
The below example illustrates the approach I am using now. The example uses words in documents (I use numeric indices for both). I want to subset for a particular word, but only if it is preceded or followed by another word or set of words:
I first create a dummy dataset with ten documents containing one million words. There are three unique words in the set.
library(data.table)
set.seed(1000)
DT<-data.table(wordindex=sample(1:3,1000000,replace=T),docindex=sample(1:10,1000000,replace=T))
setkey(DT,docindex)
DT[,position:=seq.int(1:.N),by=docindex]
wordindex docindex position
1: 1 1 1
2: 1 1 2
3: 3 1 3
4: 3 1 4
5: 1 1 5
---
999996: 2 10 99811
999997: 2 10 99812
999998: 3 10 99813
999999: 1 10 99814
1000000: 3 10 99815
Note that simply counting the occurrences of the first unique word across all documents is easy and beautiful.
setkey(DT,wordindex)
count<-DT[J(1),list(count.1=.N),by=docindex]
count
docindex count.1
1: 1 33533
2: 2 33067
3: 3 33538
4: 4 33053
5: 5 33231
6: 6 33002
7: 7 33369
8: 8 33353
9: 9 33485
10: 10 33225
It gets messier when taking the position ahead into account. This is a query to count the occurrences of the first unique word across all documents unless it is followed by the second unique word. First I create a new column containing the word one position ahead and then key on both words.
setkey(DT,docindex,position)
DT[,lead_wordindex:=DT[list(docindex,position+1)][,wordindex]]
wordindex docindex position lead_wordindex
1: 1 1 1 1
2: 1 1 2 3
3: 3 1 3 3
4: 3 1 4 1
5: 1 1 5 2
---
999996: 2 10 99811 2
999997: 2 10 99812 3
999998: 3 10 99813 1
999999: 1 10 99814 3
1000000: 3 10 99815 NA
setkey(DT,wordindex,lead_wordindex)
countr2<-DT[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]
countr2
docindex count.1
1: 1 22301
2: 2 21835
3: 3 22490
4: 4 21830
5: 5 22218
6: 6 21914
7: 7 22370
8: 8 22265
9: 9 22211
10: 10 22190
I have a very large dataset for which the above query fails for memory allocation. As an alternative, we can create this new column for only the relevant subset of data by filtering the original dataset and then joining it back on the desired position:
setkey(DT,wordindex)
filter<-DT[J(1),list(wordindex,docindex,position)]
filter[,lead_position:=position+1]
wordindex wordindex docindex position lead_position
1: 1 1 2 99717 99718
2: 1 1 3 99807 99808
3: 1 1 4 100243 100244
4: 1 1 1 1 2
5: 1 1 1 42 43
---
332852: 1 1 10 99785 99786
332853: 1 1 10 99787 99788
332854: 1 1 10 99798 99799
332855: 1 1 10 99804 99805
332856: 1 1 10 99814 99815
setkey(DT,docindex,position)
filter[,lead_wordindex:=DT[J(filter[,list(docindex,lead_position)])][,wordindex]]
wordindex wordindex docindex position lead_position lead_wordindex
1: 1 1 2 99717 99718 NA
2: 1 1 3 99807 99808 NA
3: 1 1 4 100243 100244 NA
4: 1 1 1 1 2 1
5: 1 1 1 42 43 1
---
332852: 1 1 10 99785 99786 3
332853: 1 1 10 99787 99788 3
332854: 1 1 10 99798 99799 3
332855: 1 1 10 99804 99805 3
332856: 1 1 10 99814 99815 3
setkey(filter,wordindex,lead_wordindex)
countr2.1<-filter[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]
countr2.1
docindex count.1
1: 1 22301
2: 2 21835
3: 3 22490
4: 4 21830
5: 5 22218
6: 6 21914
7: 7 22370
8: 8 22265
9: 9 22211
10: 10 22190
Pretty ugly, I think. In addition, I may want to look more than one word ahead, necessitating the creation of yet another column. The easy but costly way is:
setkey(DT,docindex,position)
DT[,lead_lead_wordindex:=DT[list(docindex,position+2)][,wordindex]]
wordindex docindex position lead_wordindex lead_lead_wordindex
1: 1 1 1 1 3
2: 1 1 2 3 3
3: 3 1 3 3 1
4: 3 1 4 1 2
5: 1 1 5 2 3
---
999996: 2 10 99811 2 3
999997: 2 10 99812 3 1
999998: 3 10 99813 1 3
999999: 1 10 99814 3 NA
1000000: 3 10 99815 NA NA
setkey(DT,wordindex,lead_wordindex,lead_lead_wordindex)
countr23<-DT[J(1,2,3),list(count.1=.N),by=docindex]
countr23
docindex count.1
1: 1 3684
2: 2 3746
3: 3 3717
4: 4 3727
5: 5 3700
6: 6 3779
7: 7 3702
8: 8 3756
9: 9 3702
10: 10 3744
However, I currently have to use the ugly filter-and-join way because of size.
So the question is, is there an easier and more beautiful way?
UPDATE:
Thanks to Arun and eddi for clean and simple code that solves the problem. On my ~200M row data, this solution works on a simple combination of words in about 10 seconds, which is quite good!
I do have an added issue, however, that makes the vector scan approach less than optimal. Although in the example I am looking for only one word combination, in practice I may have a vector of words to look for in each position. When I change the "==" statements to "%in%" for that purpose (vectors of 100 words or more), the query takes much longer to run. So I would still be interested in a binary search solution if one exists. However, if Arun doesn't know of one, it might not, and I would happily accept his answer.
Just create a lead
function and use it in your j-expression
:
lead <- function(x, n)
if (n == 0) x else c(tail(x, -n), rep.int(NA, n))
If you'd like to get the count where wordindex
at i
th position is 1L and i+1
th is not 2L, then:
DT[, sum(wordindex == 1L & lead(wordindex, 1L) != 2L, na.rm=TRUE), by=docindex]
# docindex V1
# 1: 1 22301
# 2: 2 21835
# 3: 3 22490
# 4: 4 21830
# 5: 5 22218
# 6: 6 21914
# 7: 7 22370
# 8: 8 22265
# 9: 9 22211
# 10: 10 22190
If you'd like to get counts where wordindex
at i
is 1L, i+1
is 2L and i+2
is 3L, then:
DT[, sum(wordindex == 1L & lead(wordindex, 1L) == 2L &
lead(wordindex, 2L) == 3L, na.rm=TRUE), by=docindex]
# docindex V1
# 1: 1 3684
# 2: 2 3746
# 3: 3 3717
# 4: 4 3727
# 5: 5 3700
# 6: 6 3779
# 7: 7 3702
# 8: 8 3756
# 9: 9 3702
# 10: 10 3744
Note that there's no need to setkey
here.. adhoc-by
should work great.
To address the comment:
This solution uses a vector-scan in j
as opposed to your binary search approach. But there are different trade-offs here. The code is relatively elegant, easy to read, extend to multiple lags and conditions and maintain compared to the binary search version (as I can't think of a way without creating extra columns). This requires much lesser memory, which is also a limiting factor in your case.
You say "large data", but don't say anything more. Vector scans on entire data (say 20 million rows or 200 million rows) are costly, yes. But, operating on each group, even if it wouldn't give the performance of binary search, shouldn't be much slower. Of course this again depends on the number of groups you've and the number of observations per group. But it's better to benchmark these things and figure out.
I'll leave you to it. Good luck :).
It sounds like you simply want:
DT[, sum(wordindex == 1 & c(tail(wordindex, -1), 2) != 2), by = docindex]
I don't see the point of complicating it via joins.
Btw you will get a different answer from yours in some cases, which is either because I don't understand what you want or because your method fails on some edge cases. E.g. try both methods for
DT = data.table(wordindex = c(1,1,2,1,1,2), docindex = c(1,1,2,2,3,3))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With