Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Row Referencing in R data.table package

Tags:

r

data.table

Say I have the following sample dataset:

iris <- data.table(iris)[c(1:5,51:55,101:105), list(ID=.I, Species,Sepal.Length)]

Then say that I want to calculate the absolute difference between rows within a group (in this case, Species).

iris[ , SL.Diff := c(NA,abs(diff(Sepal.Length))) , by = Species]

At this point, I have a dataset that looks like the following:

   ID    Species Sepal.Length SL.Diff
1:  1     setosa          5.1      NA
2:  2     setosa          4.9     0.2
3:  3     setosa          4.7     0.2
4:  4     setosa          4.6     0.1
5:  5     setosa          5.0     0.4
6:  6 versicolor          7.0      NA

Now I want to calculate a new variable Sepal.Length2 that takes on the next row's value if SL.Diff is less than a threshold of 0.3.

iris[ , Sepal.Length2 := ifelse(SL.Diff < 0.3, iris[ID+1]$Sepal.Length, Sepal.Length)]

This works the way I want it to. But what if I want to do the same comparison but instead of taking on the next row, I want to take on the value of the previous row?

iris[ , Sepal.Length3 := ifelse(SL.Diff < 0.3, iris[ID-1]$Sepal.Length, Sepal.Length)]

Sepal.Length3 does not give the output that I was expecting. Anyone know what I could be doing wrong here?

    ID    Species Sepal.Length SL.Diff Sepal.Length2 Sepal.Length3
 1:  1     setosa          5.1      NA            NA            NA
 2:  2     setosa          4.9     0.2           4.7           4.9
 3:  3     setosa          4.7     0.2           4.6           4.7
 4:  4     setosa          4.6     0.1           5.0           4.6
 5:  5     setosa          5.0     0.4           5.0           5.0
 6:  6 versicolor          7.0      NA            NA            NA
 7:  7 versicolor          6.4     0.6           6.4           6.4
 8:  8 versicolor          6.9     0.5           6.9           6.9
 9:  9 versicolor          5.5     1.4           5.5           5.5
10: 10 versicolor          6.5     1.0           6.5           6.5
11: 11  virginica          6.3      NA            NA            NA
12: 12  virginica          5.8     0.5           5.8           5.8
13: 13  virginica          7.1     1.3           7.1           7.1
14: 14  virginica          6.3     0.8           6.3           6.3
15: 15  virginica          6.5     0.2            NA           5.1
like image 431
Mike.Gahan Avatar asked Aug 01 '14 03:08

Mike.Gahan


People also ask

How do I select certain rows of data in R?

By using bracket notation on R DataFrame (data.name) we can select rows by column value, by index, by name, by condition e.t.c. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.

How do I add a row to a table in R?

To add or insert observation/row to an existing Data Frame in R, we use rbind() function. We can add single or multiple observations/rows to a Data Frame in R using rbind() function.

How do I refer to a column in a table in R?

We reference a data frame column with the double square bracket "[[]]" operator. For example, to retrieve the ninth column vector of the built-in data set mtcars, we write mtcars[[9]]. [1] 1 1 1 0 0 0 0 0 0 0 0 ... We can retrieve the same column vector by its name.

What package is data table in R?

Data. table is an extension of data. frame package in R. It is widely used for fast aggregation of large datasets, low latency add/update/remove of columns, quicker ordered joins, and a fast file reader.


3 Answers

Not sure of the speed implications of this, but here's another attempt:

# make a column of the next values using head()
iris[, S3 := c(NA,head(Sepal.Length,-1)), by=Species]
# overwrite those values not meeting your criteria with the original values
iris[ !(SL.Diff < 0.3), S3 := Sepal.Length]

iris
#    ID    Species Sepal.Length SL.Diff  S3
# 1:  1     setosa          5.1      NA  NA
# 2:  2     setosa          4.9     0.2 5.1
# 3:  3     setosa          4.7     0.2 4.9
# 4:  4     setosa          4.6     0.1 4.7
# 5:  5     setosa          5.0     0.4 5.0
# 6:  6 versicolor          7.0      NA  NA
# 7:  7 versicolor          6.4     0.6 6.4
# 8:  8 versicolor          6.9     0.5 6.9
# 9:  9 versicolor          5.5     1.4 5.5
#10: 10 versicolor          6.5     1.0 6.5
#11: 11  virginica          6.3      NA  NA
#12: 12  virginica          5.8     0.5 5.8
#13: 13  virginica          7.1     1.3 7.1
#14: 14  virginica          6.3     0.8 6.3
#15: 15  virginica          6.5     0.2 6.3
like image 166
thelatemail Avatar answered Sep 25 '22 23:09

thelatemail


data.table.[ evaluates i and j in the scope of the data.table in question.

Therefore

iris[ID+1]$Sepal.Length evaulates ID in the scope of iris (for a second time).

Your issue really arises because you are creating a 0 index (which is silently dropped by R)

a <- c('a','b')
a[0:1]
# [1] "a"
 a[1]
# [1] "a"

So, you need to deal better with "known NA values" and implied NA values.

Here is an approach

# calculate the "threshold" column
iris[,thresh := SL.Diff <0.3]
# where does it need to go "up" and what indexed value need it go up by
iris[!is.na(thresh), up := ifelse(thresh, ID+1L,ID)]
# create the column
iris[, S2 := Sepal.Length[up]]
# the same for "down"

iris[!is.na(thresh), down := ifelse(thresh, ID-1L,ID)]
iris[, S3 := Sepal.Length[down]]

iris
# ID       Species Sepal.Length SL.Diff thresh up  S2 down  S3
# 1:  1      setosa          5.1      NA     NA NA  NA   NA  NA
# 2:  2      setosa          4.9     0.2   TRUE  3 4.7    1 5.1
# 3:  3      setosa          4.7     0.2   TRUE  4 4.6    2 4.9
# 4:  4      setosa          4.6     0.1   TRUE  5 5.0    3 4.7
# 5:  5      setosa          5.0     0.4  FALSE  5 5.0    5 5.0
# 6:  6  versicolor          7.0      NA     NA NA  NA   NA  NA
# 7:  7  versicolor          6.4     0.6  FALSE  7 6.4    7 6.4
# 8:  8  versicolor          6.9     0.5  FALSE  8 6.9    8 6.9
# 9:  9  versicolor          5.5     1.4  FALSE  9 5.5    9 5.5
# 10: 10 versicolor          6.5     1.0  FALSE 10 6.5   10 6.5
# 11: 11  virginica          6.3      NA     NA NA  NA   NA  NA
# 12: 12  virginica          5.8     0.5  FALSE 12 5.8   12 5.8
# 13: 13  virginica          7.1     1.3  FALSE 13 7.1   13 7.1
# 14: 14  virginica          6.3     0.8  FALSE 14 6.3   14 6.3
# 15: 15  virginica          6.5     0.2   TRUE 16  NA   14 6.3
like image 41
mnel Avatar answered Sep 25 '22 23:09

mnel


I think dplyr makes this a little easier to express by providing lead() and lag() functions:

library(dplyr)
iris2 <- iris[c(1:5, 51:55, 101:105), c("Species", "Sepal.Length")]
names(iris2) <- c("species", "sepal")
iris2$id <- 1:15

iris2 %>%
  group_by(species) %>%
  mutate(
    thres = abs(sepal - lag(sepal)),
    up =   ifelse(thres < 0.3, lead(sepal), sepal),
    down = ifelse(thres < 0.3, lag(sepal), sepal)
  )
like image 40
hadley Avatar answered Sep 25 '22 23:09

hadley