I am re-writing some R scripts which analyse large volumes of data (~17 million rows), and I thought I would try to improve it's memory efficiency by using the <code>data.table</code> package (which I am only just learning!). One part of the code has been puzzling me for a bit. I can't post my original solution because (1) it is crap(slow!), and (2) it's very nuanced with regards to the data, and will just complicate this question. Instead, I have made this toy example (and it really is a toy example): <pre class="prettyprint"><code>ds <- data.table(ID=c(1,1,1,1,2,2,2,3,3,3), Obs=c(1.5,2.5,0.0,1.25,1.45,1.5,2.5,0.0,1.25,1.45), Pos=c(1,3,5,6,2,3,5,2,3,4)) </code></pre> Which looks like this: <pre class="prettyprint"><code> ID Obs Pos 1: 1 1.50 1 2: 1 2.50 3 3: 1 0.00 5 4: 1 1.25 6 5: 2 1.45 2 6: 2 1.50 3 7: 2 2.50 5 8: 3 0.00 2 9: 3 1.25 3 10: 3 1.45 4 </code></pre> For ease of explanation, I'll pretend that we are observing trains (each train having its own ID), moving across a linear one-way track, with observations (some value, not of import to the question) about the train being made at set positions (pos, here from 1-6) along the track. It is not expected that a train will make it the entire length of the track (maybe it exploded before making it to pos 6), and sometimes an observation is missed by the observer... The positions are consecutive (thus if we missed observing the train at position 4, but we observed it at position 5, we know it must have passed through position 4). From the above data.table, I need to generate a table like this: <pre class="prettyprint"><code> Pos Count 1: 1 3 2: 2 3 3: 3 3 4: 4 3 5: 5 2 6: 6 1 </code></pre> Where for each unique Pos in my data.table ds, I have a count of the number of trains that made it to that position on the track (or further), regardless of whether the observation was made at that position on the track. If anyone has any ideas or suggestions as to how to tackle this, it would be much appreciated. Unfortunately, I'm not familiar enough with data.table to know if this could be done! Or it could be incredibly simple problem to solve and I'm just slow :)

<code>data.table</code> sounds like an excellent solution. From the way the data are ordered one could find the maximum of each train with <pre class="prettyprint"><code>maxPos = ds$Pos[!duplicated(ds$ID, fromLast=TRUE)] </code></pre> Then tabulate the trains that reach that position <pre class="prettyprint"><code>nAtMax = tabulate(maxPos) </code></pre> and calculate the cumulative sum of trains at each position, counting from the end <pre class="prettyprint"><code>rev(cumsum(rev(nAtMax))) ## [1] 3 3 3 3 2 1 </code></pre> I think this will be quite fast for large data, though not entirely memory efficient.

Complicated summary function -- is it possible to solve with R data.table package?

Tags:

I am re-writing some R scripts which analyse large volumes of data (~17 million rows), and I thought I would try to improve it's memory efficiency by using the data.table package (which I am only just learning!).

One part of the code has been puzzling me for a bit. I can't post my original solution because (1) it is crap(slow!), and (2) it's very nuanced with regards to the data, and will just complicate this question.

Instead, I have made this toy example (and it really is a toy example):

ds <- data.table(ID=c(1,1,1,1,2,2,2,3,3,3),
Obs=c(1.5,2.5,0.0,1.25,1.45,1.5,2.5,0.0,1.25,1.45), 
Pos=c(1,3,5,6,2,3,5,2,3,4))

Which looks like this:

    ID  Obs Pos
 1:  1 1.50   1
 2:  1 2.50   3
 3:  1 0.00   5
 4:  1 1.25   6
 5:  2 1.45   2
 6:  2 1.50   3
 7:  2 2.50   5
 8:  3 0.00   2
 9:  3 1.25   3
10:  3 1.45   4

For ease of explanation, I'll pretend that we are observing trains (each train having its own ID), moving across a linear one-way track, with observations (some value, not of import to the question) about the train being made at set positions (pos, here from 1-6) along the track. It is not expected that a train will make it the entire length of the track (maybe it exploded before making it to pos 6), and sometimes an observation is missed by the observer... The positions are consecutive (thus if we missed observing the train at position 4, but we observed it at position 5, we know it must have passed through position 4).

From the above data.table, I need to generate a table like this:

   Pos Count
1:   1     3
2:   2     3
3:   3     3
4:   4     3
5:   5     2
6:   6     1

Where for each unique Pos in my data.table ds, I have a count of the number of trains that made it to that position on the track (or further), regardless of whether the observation was made at that position on the track.

If anyone has any ideas or suggestions as to how to tackle this, it would be much appreciated. Unfortunately, I'm not familiar enough with data.table to know if this could be done! Or it could be incredibly simple problem to solve and I'm just slow :)

602

asked Jan 08 '14 06:01

Meep

2 Answers

Great question!! The example data is especially well constructed and well explained.

First I'll show this answer, then I'll explain it step by step.

> ids = 1:3   # or from the data: unique(ds$ID)
> pos = 1:6   # or from the data: unique(ds$Pos)
> setkey(ds,ID,Pos)

> ds[CJ(ids,pos), roll=-Inf, nomatch=0][, .N, by=Pos]
   Pos N
1:   1 3
2:   2 3
3:   3 3
4:   4 3
5:   5 2
6:   6 1
>

That should also be very efficient on your large data.

Step by step

First I tried a Cross Join (CJ); i.e., for each train for each position.

> ds[CJ(ids,pos)]
    ID Pos  Obs
 1:  1   1 1.50
 2:  1   2   NA
 3:  1   3 2.50
 4:  1   4   NA
 5:  1   5 0.00
 6:  1   6 1.25
 7:  2   1   NA
 8:  2   2 1.45
 9:  2   3 1.50
10:  2   4   NA
11:  2   5 2.50
12:  2   6   NA
13:  3   1   NA
14:  3   2 0.00
15:  3   3 1.25
16:  3   4 1.45
17:  3   5   NA
18:  3   6   NA

I see 6 rows per train. I see 3 trains. I've got 18 rows as I expected. I see NA where that train wasn't observed. Good. Check. The cross join seems to be working. Let's now build the query up.

You wrote if a train is observed at position n it must have passed previous positions. Immediately I'm thinking roll. Let's try it.

ds[CJ(ids,pos), roll=TRUE]
    ID Pos  Obs
 1:  1   1 1.50
 2:  1   2 1.50
 3:  1   3 2.50
 4:  1   4 2.50
 5:  1   5 0.00
 6:  1   6 1.25
 7:  2   1   NA
 8:  2   2 1.45
 9:  2   3 1.50
10:  2   4 1.50
11:  2   5 2.50
12:  2   6 2.50
13:  3   1   NA
14:  3   2 0.00
15:  3   3 1.25
16:  3   4 1.45
17:  3   5 1.45
18:  3   6 1.45

Hm. That rolled the observations forwards for each train. It left some NA at position 1 for trains 2 and 3, but you said if a train is observed at position 2 it must have passed position 1. It also rolled the last observation for trains 2 and 3 forward to position 6, but you said trains might explode. So, we want to roll backwards! That's roll=-Inf. It's a complicated -Inf because you can also control how far to roll backwards, but we don't need that for this question; we just want to roll backwards indefinitely. Let's try roll=-Inf and see what happens.

> ds[CJ(ids,pos), roll=-Inf]
    ID Pos  Obs
 1:  1   1 1.50
 2:  1   2 2.50
 3:  1   3 2.50
 4:  1   4 0.00
 5:  1   5 0.00
 6:  1   6 1.25
 7:  2   1 1.45
 8:  2   2 1.45
 9:  2   3 1.50
10:  2   4 2.50
11:  2   5 2.50
12:  2   6   NA
13:  3   1 0.00
14:  3   2 0.00
15:  3   3 1.25
16:  3   4 1.45
17:  3   5   NA
18:  3   6   NA

That's better. Almost there. All we need to do now is count. But, those pesky NA are there after trains 2 and 3 exploded. Let's remove them.

> ds[CJ(ids,pos), roll=-Inf, nomatch=0]
    ID Pos  Obs
 1:  1   1 1.50
 2:  1   2 2.50
 3:  1   3 2.50
 4:  1   4 0.00
 5:  1   5 0.00
 6:  1   6 1.25
 7:  2   1 1.45
 8:  2   2 1.45
 9:  2   3 1.50
10:  2   4 2.50
11:  2   5 2.50
12:  3   1 0.00
13:  3   2 0.00
14:  3   3 1.25
15:  3   4 1.45

Btw, data.table likes as much as possible to be inside one single DT[...] as that's how it optimizes the query. Internally, it doesn't create the NA and then remove them; it never creates the NA in the first place. This concept is important for efficiency.

Finally, all we have to do is count. We can just tack this on the end as a compound query.

> ds[CJ(ids,pos), roll=-Inf, nomatch=0][, .N, by=Pos]
   Pos N
1:   1 3
2:   2 3
3:   3 3
4:   4 3
5:   5 2
6:   6 1

192

answered Oct 04 '22 10:10

Matt Dowle

data.table sounds like an excellent solution. From the way the data are ordered one could find the maximum of each train with

maxPos = ds$Pos[!duplicated(ds$ID, fromLast=TRUE)]

Then tabulate the trains that reach that position

nAtMax = tabulate(maxPos)

and calculate the cumulative sum of trains at each position, counting from the end

rev(cumsum(rev(nAtMax)))
## [1] 3 3 3 3 2 1

I think this will be quite fast for large data, though not entirely memory efficient.

answered Oct 05 '22 10:10

Martin Morgan

Related questions
                            
                                Cross Platform Desktop Application - Windows+Mac+Linux [closed]
                            
                                Pandoc generation of pdf from markdown 4th header is rendered differently
                            
                                Excel VBA Run Time Error '424' object required
                            
                                How to use pundit scopes?
                            
                                Using scikit-learn vectorizers and vocabularies with gensim
                            
                                CreateTextRange is not working in Chrome
                            
                                python XlsxWriter set border around multiple cells
                            
                                How to store the output of command in a variable without creating a subshell [Bash <v4]
                            
                                How to configure sbt test / ScalaTest to only show failures?
                            
                                How to print current bash prompt?
                            
                                Do C/C++ compilers such as GCC generally optimize modulo by a constant power of 2?
                            
                                What is the data structure behind NSMutableArray?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With