Elegant way to identify runs inside a data.table

Tags:

I've run into this problem twice in the last two weeks alone, so I figured it's worth a post. I'm trying to identify "runs" inside a data.table, but I can't figure out an elegant way to do it.

Example

set.seed(2016)
dt <- data.table(ID = 1:50, Char = sample(LETTERS, 50, replace=TRUE))
dt <- dt[order(Char, ID)]

    ID Char
 1:  9    A
 2: 10    B
 3: 20    C
 4: 42    C
 5:  2    D
 6:  4    D
 7:  6    D
 8: 18    D
 ...

Here, I'd like to identify and group rows where the ID is within 2 of the row above/below. Here's my currently ugly solution

# Runs of 2 or more IDs within 2 of each other
dt[, `:=`(InRun = FALSE, InRunStart = FALSE)]
dt[abs(ID - shift(ID, type="lag")) <= 2 | abs(shift(ID, type="lead") - ID) <= 2, InRun := TRUE]
dt[InRun == TRUE & abs(ID - shift(ID, type="lag")) > 2 | is.na(shift(ID, type="lag")), InRunStart := TRUE]
dt[InRun == TRUE, RunID := cumsum(InRunStart)]
dt[, c("InRun", "InRunStart") := NULL]
dt
    ID Char RunID
 1:  9    A     1
 2: 10    B     1
 3: 20    C    NA
 4: 42    C    NA
 5:  2    D     2
 6:  4    D     2
 7:  6    D     2
 8: 18    D    NA
 ...

Is there a better way to do this?

EDIT: It seems there's been some confusion over how I'm defining a "run". To put it more explicitly, row_i and row_i+1 should have the same RunID if and only if their IDs are within a distance of 2.

532

asked Nov 15 '16 01:11

Ben

2 Answers

I would stop after making this run ID:

dt[, run_id0 := 1L + cumsum(abs(ID - shift(ID, fill=ID[1L])) > 2)]

But to get the OP's run ID (which ignores length-one runs), here are a couple ways to go:

dt[duplicated(run_id0) | duplicated(run_id0, fromLast=TRUE), run_id1 := .GRP, by=run_id0 ]
# or
dt[, run_len := .N, by=run_id0 ][ run_len > 1L, run_id2 := .GRP, by=run_id0 ]

194

answered Sep 24 '22 16:09

Frank

Don't know if this is elegant or not, but how about:

dt <- data.table(ID = c(9, 10, 15, 18, 21, 22, 25))
run_ids <- abs(dt[1:(.N-1), ID] - dt[2:.N, ID]) <= 2
run_ids <- c(run_ids[1], run_ids)
foo <- with(rle(run_ids), rep(cumsum(values) * values, lengths))
foo[foo == 0] = foo[which(foo == 0) + 1]
dt[, RunID := foo]
dt[RunID == 0, RunID := NA]
#    ID RunID
# 1:  9     1
# 2: 10     1
# 3: 15    NA
# 4: 18    NA
# 5: 21     2
# 6: 22     2
# 7: 25    NA

answered Sep 23 '22 16:09

John Smith

Related questions
                            
                                How to construct url in swift 3/4 using URL(string: , relativeTo:)
                            
                                Change yyin to argv[1] Flex & Bison
                            
                                SSH ok but Ansible returns "unreachable"
                            
                                What is the fast way to do symbolic integration in sympy
                            
                                How may I detect requests generated by other requests using Ruby and Puma?
                            
                                How to relocate the dz-remove link Dropzone.js
                            
                                No activity days detection using R
                            
                                Spring boot 1.4.2.RELEASE error on startup java.lang.ClassNotFoundException: org.springframework.beans.factory.ObjectProvider
                            
                                Deserialize object based on value type in property
                            
                                error TS2307: Cannot find module 'react'
                            
                                Eclipse runs ANT twice, even sending run only once
                            
                                Fix Drag Drop Freezing Explorer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With