Say I have the following sample dataset:
iris <- data.table(iris)[c(1:5,51:55,101:105), list(ID=.I, Species,Sepal.Length)]
Then say that I want to calculate the absolute difference between rows within a group (in this case, Species
).
iris[ , SL.Diff := c(NA,abs(diff(Sepal.Length))) , by = Species]
At this point, I have a dataset that looks like the following:
ID Species Sepal.Length SL.Diff
1: 1 setosa 5.1 NA
2: 2 setosa 4.9 0.2
3: 3 setosa 4.7 0.2
4: 4 setosa 4.6 0.1
5: 5 setosa 5.0 0.4
6: 6 versicolor 7.0 NA
Now I want to calculate a new variable Sepal.Length2
that takes on the next row's value if SL.Diff
is less than a threshold of 0.3.
iris[ , Sepal.Length2 := ifelse(SL.Diff < 0.3, iris[ID+1]$Sepal.Length, Sepal.Length)]
This works the way I want it to. But what if I want to do the same comparison but instead of taking on the next row, I want to take on the value of the previous row?
iris[ , Sepal.Length3 := ifelse(SL.Diff < 0.3, iris[ID-1]$Sepal.Length, Sepal.Length)]
Sepal.Length3
does not give the output that I was expecting. Anyone know what I could be doing wrong here?
ID Species Sepal.Length SL.Diff Sepal.Length2 Sepal.Length3
1: 1 setosa 5.1 NA NA NA
2: 2 setosa 4.9 0.2 4.7 4.9
3: 3 setosa 4.7 0.2 4.6 4.7
4: 4 setosa 4.6 0.1 5.0 4.6
5: 5 setosa 5.0 0.4 5.0 5.0
6: 6 versicolor 7.0 NA NA NA
7: 7 versicolor 6.4 0.6 6.4 6.4
8: 8 versicolor 6.9 0.5 6.9 6.9
9: 9 versicolor 5.5 1.4 5.5 5.5
10: 10 versicolor 6.5 1.0 6.5 6.5
11: 11 virginica 6.3 NA NA NA
12: 12 virginica 5.8 0.5 5.8 5.8
13: 13 virginica 7.1 1.3 7.1 7.1
14: 14 virginica 6.3 0.8 6.3 6.3
15: 15 virginica 6.5 0.2 NA 5.1
By using bracket notation on R DataFrame (data.name) we can select rows by column value, by index, by name, by condition e.t.c. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.
To add or insert observation/row to an existing Data Frame in R, we use rbind() function. We can add single or multiple observations/rows to a Data Frame in R using rbind() function.
We reference a data frame column with the double square bracket "[[]]" operator. For example, to retrieve the ninth column vector of the built-in data set mtcars, we write mtcars[[9]]. [1] 1 1 1 0 0 0 0 0 0 0 0 ... We can retrieve the same column vector by its name.
Data. table is an extension of data. frame package in R. It is widely used for fast aggregation of large datasets, low latency add/update/remove of columns, quicker ordered joins, and a fast file reader.
Not sure of the speed implications of this, but here's another attempt:
# make a column of the next values using head()
iris[, S3 := c(NA,head(Sepal.Length,-1)), by=Species]
# overwrite those values not meeting your criteria with the original values
iris[ !(SL.Diff < 0.3), S3 := Sepal.Length]
iris
# ID Species Sepal.Length SL.Diff S3
# 1: 1 setosa 5.1 NA NA
# 2: 2 setosa 4.9 0.2 5.1
# 3: 3 setosa 4.7 0.2 4.9
# 4: 4 setosa 4.6 0.1 4.7
# 5: 5 setosa 5.0 0.4 5.0
# 6: 6 versicolor 7.0 NA NA
# 7: 7 versicolor 6.4 0.6 6.4
# 8: 8 versicolor 6.9 0.5 6.9
# 9: 9 versicolor 5.5 1.4 5.5
#10: 10 versicolor 6.5 1.0 6.5
#11: 11 virginica 6.3 NA NA
#12: 12 virginica 5.8 0.5 5.8
#13: 13 virginica 7.1 1.3 7.1
#14: 14 virginica 6.3 0.8 6.3
#15: 15 virginica 6.5 0.2 6.3
data.table.[
evaluates i
and j
in the scope of the data.table in question.
Therefore
iris[ID+1]$Sepal.Length
evaulates ID
in the scope of iris
(for a second time).
Your issue really arises because you are creating a 0
index (which is silently dropped by R
)
a <- c('a','b')
a[0:1]
# [1] "a"
a[1]
# [1] "a"
So, you need to deal better with "known NA values" and implied NA values.
Here is an approach
# calculate the "threshold" column
iris[,thresh := SL.Diff <0.3]
# where does it need to go "up" and what indexed value need it go up by
iris[!is.na(thresh), up := ifelse(thresh, ID+1L,ID)]
# create the column
iris[, S2 := Sepal.Length[up]]
# the same for "down"
iris[!is.na(thresh), down := ifelse(thresh, ID-1L,ID)]
iris[, S3 := Sepal.Length[down]]
iris
# ID Species Sepal.Length SL.Diff thresh up S2 down S3
# 1: 1 setosa 5.1 NA NA NA NA NA NA
# 2: 2 setosa 4.9 0.2 TRUE 3 4.7 1 5.1
# 3: 3 setosa 4.7 0.2 TRUE 4 4.6 2 4.9
# 4: 4 setosa 4.6 0.1 TRUE 5 5.0 3 4.7
# 5: 5 setosa 5.0 0.4 FALSE 5 5.0 5 5.0
# 6: 6 versicolor 7.0 NA NA NA NA NA NA
# 7: 7 versicolor 6.4 0.6 FALSE 7 6.4 7 6.4
# 8: 8 versicolor 6.9 0.5 FALSE 8 6.9 8 6.9
# 9: 9 versicolor 5.5 1.4 FALSE 9 5.5 9 5.5
# 10: 10 versicolor 6.5 1.0 FALSE 10 6.5 10 6.5
# 11: 11 virginica 6.3 NA NA NA NA NA NA
# 12: 12 virginica 5.8 0.5 FALSE 12 5.8 12 5.8
# 13: 13 virginica 7.1 1.3 FALSE 13 7.1 13 7.1
# 14: 14 virginica 6.3 0.8 FALSE 14 6.3 14 6.3
# 15: 15 virginica 6.5 0.2 TRUE 16 NA 14 6.3
I think dplyr makes this a little easier to express by providing lead()
and lag()
functions:
library(dplyr)
iris2 <- iris[c(1:5, 51:55, 101:105), c("Species", "Sepal.Length")]
names(iris2) <- c("species", "sepal")
iris2$id <- 1:15
iris2 %>%
group_by(species) %>%
mutate(
thres = abs(sepal - lag(sepal)),
up = ifelse(thres < 0.3, lead(sepal), sepal),
down = ifelse(thres < 0.3, lag(sepal), sepal)
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With