Say I have the following data.table:
library(data.table)
DT <- data.table(R=sample(0:1, 10000, rep=TRUE), Seq=0)
Which returns something like:
R Seq
1: 1 0
2: 1 0
3: 0 0
4: 0 0
5: 1 0
---
9996: 1 0
9997: 0 0
9998: 0 0
9999: 0 0
10000: 1 0
I want to generate a sequence (1, 2, 3,..., n) that resets whenever R changes from the previous row. Think of it like I'm counting a streak of random numbers.
So the above would then look like:
R Seq
1: 1 1
2: 1 2
3: 0 1
4: 0 2
5: 1 1
---
9996: 1 5
9997: 0 1
9998: 0 2
9999: 0 3
10000: 1 2
Thoughts?
The syntax to create a sequence in SQL Server (Transact-SQL) is: CREATE SEQUENCE [schema.] sequence_name [ AS datatype ] [ START WITH value ] [ INCREMENT BY value ] [ MINVALUE value | NO MINVALUE ] [ MAXVALUE value | NO MAXVALUE ] [ CYCLE | NO CYCLE ] [ CACHE value | NO CACHE ]; AS datatype.
To number rows in a result set, you have to use an SQL window function called ROW_NUMBER() . This function assigns a sequential integer number to each result row.
The Rank function can be used to generate a sequential number for each row or to give a rank based on specific criteria. The ranking function returns a ranking value for each row. However, based on criteria more than one row can get the same rank.
Syntax: CREATE SEQUENCE sequence_name START WITH initial_value INCREMENT BY increment_value MINVALUE minimum value MAXVALUE maximum value CYCLE|NOCYCLE ; sequence_name: Name of the sequence. initial_value: starting value from where the sequence starts.
Here is an option:
set.seed(1)
DT <- data.table(R=sample(0:1, 10000, rep=TRUE), Seq=0L)
DT[, Seq:=seq(.N), by=list(cumsum(c(0, abs(diff(R)))))]
DT
We create a counter that increments every time your 0-1 variable changes using cumsum(abs(diff(R)))
. The c(0,
part is to ensure we get the correct length vector. Then we split by it with by
. This produces:
R Seq
1: 0 1
2: 0 2
3: 1 1
4: 1 2
5: 0 1
---
9996: 1 1
9997: 0 1
9998: 1 1
9999: 1 2
10000: 1 3
EDIT: Addressing request for clarification:
lets look at the computation I'm using in by
, broken down into two new columns:
DT[, diff:=c(0, diff(R))]
DT[, cumsum:=cumsum(abs(diff))]
print(DT, topn=10)
Produces:
R Seq diff cumsum
1: 0 1 0 0
2: 0 2 0 0
3: 1 1 1 1
4: 1 2 0 1
5: 0 1 -1 2
6: 1 1 1 3
7: 1 2 0 3
8: 1 3 0 3
9: 1 4 0 3
10: 0 1 -1 4
---
9991: 1 2 0 5021
9992: 1 3 0 5021
9993: 1 4 0 5021
9994: 1 5 0 5021
9995: 0 1 -1 5022
9996: 1 1 1 5023
9997: 0 1 -1 5024
9998: 1 1 1 5025
9999: 1 2 0 5025
10000: 1 3 0 5025
You can see how the cumulative sum of the absolute of the diff increments by one each time R changes. We can then use that cumsum
column to break up the data.table
into chunks, and for each chunk, generate a sequence using seq(.N)
that counts to the number of items in the chunk (.N
represents exactly that, how many items in each by
group).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With