Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elegant way to identify runs inside a data.table

Tags:

I've run into this problem twice in the last two weeks alone, so I figured it's worth a post. I'm trying to identify "runs" inside a data.table, but I can't figure out an elegant way to do it.

Example

set.seed(2016)
dt <- data.table(ID = 1:50, Char = sample(LETTERS, 50, replace=TRUE))
dt <- dt[order(Char, ID)]

    ID Char
 1:  9    A
 2: 10    B
 3: 20    C
 4: 42    C
 5:  2    D
 6:  4    D
 7:  6    D
 8: 18    D
 ...

Here, I'd like to identify and group rows where the ID is within 2 of the row above/below. Here's my currently ugly solution

# Runs of 2 or more IDs within 2 of each other
dt[, `:=`(InRun = FALSE, InRunStart = FALSE)]
dt[abs(ID - shift(ID, type="lag")) <= 2 | abs(shift(ID, type="lead") - ID) <= 2, InRun := TRUE]
dt[InRun == TRUE & abs(ID - shift(ID, type="lag")) > 2 | is.na(shift(ID, type="lag")), InRunStart := TRUE]
dt[InRun == TRUE, RunID := cumsum(InRunStart)]
dt[, c("InRun", "InRunStart") := NULL]
dt
    ID Char RunID
 1:  9    A     1
 2: 10    B     1
 3: 20    C    NA
 4: 42    C    NA
 5:  2    D     2
 6:  4    D     2
 7:  6    D     2
 8: 18    D    NA
 ...

Is there a better way to do this?


EDIT: It seems there's been some confusion over how I'm defining a "run". To put it more explicitly, row_i and row_i+1 should have the same RunID if and only if their IDs are within a distance of 2.

like image 532
Ben Avatar asked Nov 15 '16 01:11

Ben


People also ask

What is := in data table?

Modify / Add / Delete columns To modify an existing column, or create a new one, use the := operator. Using the data. table := operator modifies the existing object 'in place', which has the benefit of being memory-efficient. Memory management is an important aspect of data.

What should be included in a data table?

All data tables should have a descriptive title which describes what data is being gathered. Data should be arranged in columns (not rows). The top of each column should have the variable name listed with the units used in parentheses.

How do you describe a data table?

The data table is perhaps the most basic building block of business intelligence. In its simplest form, it consists of a series of columns and rows that intersect in cells, plus a header row in which the names of the columns are stated, to make the content of the table understandable to the end user.

How do you read a data table?

A table can be read from left to right or from top to bottom. If you read a table across the row, you read the information from left to right. In the Cats and Dogs Table, the number of black animals is 2 + 2 = 4. You'll see that those are the numbers in the row directly to the right of the word 'Black.


2 Answers

I would stop after making this run ID:

dt[, run_id0 := 1L + cumsum(abs(ID - shift(ID, fill=ID[1L])) > 2)]

But to get the OP's run ID (which ignores length-one runs), here are a couple ways to go:

dt[duplicated(run_id0) | duplicated(run_id0, fromLast=TRUE), run_id1 := .GRP, by=run_id0 ]
# or
dt[, run_len := .N, by=run_id0 ][ run_len > 1L, run_id2 := .GRP, by=run_id0 ]
like image 194
Frank Avatar answered Sep 24 '22 16:09

Frank


Don't know if this is elegant or not, but how about:

dt <- data.table(ID = c(9, 10, 15, 18, 21, 22, 25))
run_ids <- abs(dt[1:(.N-1), ID] - dt[2:.N, ID]) <= 2
run_ids <- c(run_ids[1], run_ids)
foo <- with(rle(run_ids), rep(cumsum(values) * values, lengths))
foo[foo == 0] = foo[which(foo == 0) + 1]
dt[, RunID := foo]
dt[RunID == 0, RunID := NA]
#    ID RunID
# 1:  9     1
# 2: 10     1
# 3: 15    NA
# 4: 18    NA
# 5: 21     2
# 6: 22     2
# 7: 25    NA
like image 45
John Smith Avatar answered Sep 23 '22 16:09

John Smith