Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a sequence in a data.table depending on a column

Tags:

r

data.table

Say I have the following data.table:

library(data.table)

DT <- data.table(R=sample(0:1, 10000, rep=TRUE), Seq=0)

Which returns something like:

       R Seq
    1: 1   0
    2: 1   0
    3: 0   0
    4: 0   0
    5: 1   0
   ---      
 9996: 1   0
 9997: 0   0
 9998: 0   0
 9999: 0   0
10000: 1   0

I want to generate a sequence (1, 2, 3,..., n) that resets whenever R changes from the previous row. Think of it like I'm counting a streak of random numbers.

So the above would then look like:

       R Seq
    1: 1   1
    2: 1   2
    3: 0   1
    4: 0   2
    5: 1   1
   ---      
 9996: 1   5
 9997: 0   1
 9998: 0   2
 9999: 0   3
10000: 1   2

Thoughts?

like image 415
wizard_draziw Avatar asked Aug 20 '14 22:08

wizard_draziw


People also ask

How to insert sequence number in SQL table?

The syntax to create a sequence in SQL Server (Transact-SQL) is: CREATE SEQUENCE [schema.] sequence_name [ AS datatype ] [ START WITH value ] [ INCREMENT BY value ] [ MINVALUE value | NO MINVALUE ] [ MAXVALUE value | NO MAXVALUE ] [ CYCLE | NO CYCLE ] [ CACHE value | NO CACHE ]; AS datatype.

How to create SEquential number in SQL?

To number rows in a result set, you have to use an SQL window function called ROW_NUMBER() . This function assigns a sequential integer number to each result row.

How to get sequence number in SQL query?

The Rank function can be used to generate a sequential number for each row or to give a rank based on specific criteria. The ranking function returns a ranking value for each row. However, based on criteria more than one row can get the same rank.

How do you store a sequence in a database?

Syntax: CREATE SEQUENCE sequence_name START WITH initial_value INCREMENT BY increment_value MINVALUE minimum value MAXVALUE maximum value CYCLE|NOCYCLE ; sequence_name: Name of the sequence. initial_value: starting value from where the sequence starts.


1 Answers

Here is an option:

set.seed(1)
DT <- data.table(R=sample(0:1, 10000, rep=TRUE), Seq=0L)
DT[, Seq:=seq(.N), by=list(cumsum(c(0, abs(diff(R)))))]
DT

We create a counter that increments every time your 0-1 variable changes using cumsum(abs(diff(R))). The c(0, part is to ensure we get the correct length vector. Then we split by it with by. This produces:

       R Seq
    1: 0   1
    2: 0   2
    3: 1   1
    4: 1   2
    5: 0   1
   ---      
 9996: 1   1
 9997: 0   1
 9998: 1   1
 9999: 1   2
10000: 1   3

EDIT: Addressing request for clarification:

lets look at the computation I'm using in by, broken down into two new columns:

DT[, diff:=c(0, diff(R))]
DT[, cumsum:=cumsum(abs(diff))]
print(DT, topn=10)

Produces:

       R Seq diff cumsum
    1: 0   1    0      0
    2: 0   2    0      0
    3: 1   1    1      1
    4: 1   2    0      1
    5: 0   1   -1      2
    6: 1   1    1      3
    7: 1   2    0      3
    8: 1   3    0      3
    9: 1   4    0      3
   10: 0   1   -1      4
   ---                  
 9991: 1   2    0   5021
 9992: 1   3    0   5021
 9993: 1   4    0   5021
 9994: 1   5    0   5021
 9995: 0   1   -1   5022
 9996: 1   1    1   5023
 9997: 0   1   -1   5024
 9998: 1   1    1   5025
 9999: 1   2    0   5025
10000: 1   3    0   5025

You can see how the cumulative sum of the absolute of the diff increments by one each time R changes. We can then use that cumsum column to break up the data.table into chunks, and for each chunk, generate a sequence using seq(.N) that counts to the number of items in the chunk (.N represents exactly that, how many items in each by group).

like image 196
BrodieG Avatar answered Oct 04 '22 18:10

BrodieG