I've run into this problem twice in the last two weeks alone, so I figured it's worth a post. I'm trying to identify "runs" inside a data.table
, but I can't figure out an elegant way to do it.
set.seed(2016)
dt <- data.table(ID = 1:50, Char = sample(LETTERS, 50, replace=TRUE))
dt <- dt[order(Char, ID)]
ID Char
1: 9 A
2: 10 B
3: 20 C
4: 42 C
5: 2 D
6: 4 D
7: 6 D
8: 18 D
...
Here, I'd like to identify and group rows where the ID is within 2 of the row above/below. Here's my currently ugly solution
# Runs of 2 or more IDs within 2 of each other
dt[, `:=`(InRun = FALSE, InRunStart = FALSE)]
dt[abs(ID - shift(ID, type="lag")) <= 2 | abs(shift(ID, type="lead") - ID) <= 2, InRun := TRUE]
dt[InRun == TRUE & abs(ID - shift(ID, type="lag")) > 2 | is.na(shift(ID, type="lag")), InRunStart := TRUE]
dt[InRun == TRUE, RunID := cumsum(InRunStart)]
dt[, c("InRun", "InRunStart") := NULL]
dt
ID Char RunID
1: 9 A 1
2: 10 B 1
3: 20 C NA
4: 42 C NA
5: 2 D 2
6: 4 D 2
7: 6 D 2
8: 18 D NA
...
Is there a better way to do this?
EDIT: It seems there's been some confusion over how I'm defining a "run". To put it more explicitly, row_i and row_i+1 should have the same RunID if and only if their IDs are within a distance of 2.
Modify / Add / Delete columns To modify an existing column, or create a new one, use the := operator. Using the data. table := operator modifies the existing object 'in place', which has the benefit of being memory-efficient. Memory management is an important aspect of data.
All data tables should have a descriptive title which describes what data is being gathered. Data should be arranged in columns (not rows). The top of each column should have the variable name listed with the units used in parentheses.
The data table is perhaps the most basic building block of business intelligence. In its simplest form, it consists of a series of columns and rows that intersect in cells, plus a header row in which the names of the columns are stated, to make the content of the table understandable to the end user.
A table can be read from left to right or from top to bottom. If you read a table across the row, you read the information from left to right. In the Cats and Dogs Table, the number of black animals is 2 + 2 = 4. You'll see that those are the numbers in the row directly to the right of the word 'Black.
I would stop after making this run ID:
dt[, run_id0 := 1L + cumsum(abs(ID - shift(ID, fill=ID[1L])) > 2)]
But to get the OP's run ID (which ignores length-one runs), here are a couple ways to go:
dt[duplicated(run_id0) | duplicated(run_id0, fromLast=TRUE), run_id1 := .GRP, by=run_id0 ]
# or
dt[, run_len := .N, by=run_id0 ][ run_len > 1L, run_id2 := .GRP, by=run_id0 ]
Don't know if this is elegant or not, but how about:
dt <- data.table(ID = c(9, 10, 15, 18, 21, 22, 25))
run_ids <- abs(dt[1:(.N-1), ID] - dt[2:.N, ID]) <= 2
run_ids <- c(run_ids[1], run_ids)
foo <- with(rle(run_ids), rep(cumsum(values) * values, lengths))
foo[foo == 0] = foo[which(foo == 0) + 1]
dt[, RunID := foo]
dt[RunID == 0, RunID := NA]
# ID RunID
# 1: 9 1
# 2: 10 1
# 3: 15 NA
# 4: 18 NA
# 5: 21 2
# 6: 22 2
# 7: 25 NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With