I have a data.table that looks like this
ID, Order, Segment
1, 1, A
1, 2, B
1, 3, B
1, 4, C
1, 5, B
1, 6, B
1, 7, B
1, 8, B
Basically by ordering the data using the Order column. I would like to understand the number of consecutive B's for each of the ID's. Ideally the output I would like is
ID, Consec
1, 2
1, 4
Because the segment B appears consecutively in row 2 and 3 (2 times), and then again in row 5,6,7,8 (4 times).
The loop solution is quite obvious but would also be very slow.
Are there elegant solutions in data.table that is also fast?
P.S. The data I am dealing with has ~20 million rows.
Counting all of the Rows in a Table. To counts all of the rows in a table, whether they contain NULL values or not, use COUNT(*). That form of the COUNT() function basically returns the number of rows in a result set returned by a SELECT statement.
COUNT(*) returns the number of rows in a specified table, and it preserves duplicate rows. It counts each row separately. This includes rows that contain null values.
SQL Fiddle. The trick is to - per contact - subtract the running count ( row_number() ) from each year. Consecutive rows produce the same group number ( grp ). the number itself has no meaning, it just identifies groups per contact . Then count, sort, get the maximum count.
Try
library(data.table)#v1.9.5+
DT[order(ID, Order)][, indx:=rleid(Segment)][Segment=='B',
list(Consec=.N), by = list(indx, ID)][,indx:=NULL][]
# ID Consec
#1: 1 2
#2: 1 4
Or as @eddi suggested
DT[order(ID, Order)][, .(Consec = .N), by = .(ID, Segment,
rleid(Segment))][Segment == 'B', .(ID, Consec)]
# ID Consec
#1: 1 2
#2: 1 4
A more memory efficient method would be to use setorder
instead of order
(as suggested by @Arun)
setorder(DT, ID, Order)[, .(Consec = .N), by = .(ID, Segment,
rleid(Segment))][Segment == 'B', .(ID, Consec)]
# ID Consec
#1: 1 2
#2: 1 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With