Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count the number of times an element appears consecutively in a data.table?

Tags:

r

data.table

I have a data.table that looks like this

ID, Order, Segment
1, 1, A
1, 2, B
1, 3, B
1, 4, C
1, 5, B
1, 6, B
1, 7, B
1, 8, B

Basically by ordering the data using the Order column. I would like to understand the number of consecutive B's for each of the ID's. Ideally the output I would like is

ID, Consec
1, 2
1, 4

Because the segment B appears consecutively in row 2 and 3 (2 times), and then again in row 5,6,7,8 (4 times).

The loop solution is quite obvious but would also be very slow.

Are there elegant solutions in data.table that is also fast?

P.S. The data I am dealing with has ~20 million rows.

like image 708
xiaodai Avatar asked Apr 15 '15 04:04

xiaodai


People also ask

How can we get count of the number of records in a table?

Counting all of the Rows in a Table. To counts all of the rows in a table, whether they contain NULL values or not, use COUNT(*). That form of the COUNT() function basically returns the number of rows in a result set returned by a SELECT statement.

What does count (*) do in SQL?

COUNT(*) returns the number of rows in a specified table, and it preserves duplicate rows. It counts each row separately. This includes rows that contain null values.

How do you check for consecutive years in SQL?

SQL Fiddle. The trick is to - per contact - subtract the running count ( row_number() ) from each year. Consecutive rows produce the same group number ( grp ). the number itself has no meaning, it just identifies groups per contact . Then count, sort, get the maximum count.


1 Answers

Try

 library(data.table)#v1.9.5+
  DT[order(ID, Order)][, indx:=rleid(Segment)][Segment=='B',
    list(Consec=.N), by = list(indx, ID)][,indx:=NULL][]

 #    ID Consec
 #1:  1      2
 #2:  1      4

Or as @eddi suggested

 DT[order(ID, Order)][, .(Consec = .N), by = .(ID, Segment, 
              rleid(Segment))][Segment == 'B', .(ID, Consec)]
 #    ID Consec
 #1:  1      2
 #2:  1      4

A more memory efficient method would be to use setorder instead of order (as suggested by @Arun)

  setorder(DT, ID, Order)[, .(Consec = .N), by = .(ID, Segment, 
                rleid(Segment))][Segment == 'B', .(ID, Consec)]
  #   ID Consec
  #1:  1      2
  #2:  1      4
like image 100
akrun Avatar answered Oct 21 '22 10:10

akrun