How do I select the first row of an R data frame that meets certain criteria?
Here is the context:
I have a data frame with five columns:
"pixel", "year","propvar", "component", "cumsum."
There are 1,225 combinations of pixel
and year
, because the data was computed from the annual time series of 49 geographic pixels for each of 25 study years. Within each pixel-year, I have computed propvar
, the proportion of total variance explained by a given component of the fast Fourier transform for the time series of a given pixel-year. I then computed cumsum
, which is the cumulative sum of propvar
for each frequency component within a pixel-year. The component
column just gives you an index for the Fourier series component (plus 1) from which propvar
was calculated.
I want to determine the number of components required to explain greater than 99% of the variance. I figure one way to do this is to find the first row within each pixel-year where cumsum
> 0.99, and create a data frame from it with three columns, pixel
, year
, and numbercomps
, where numbercomps
is the number of components required within a given pixel-year to explain greater than 99% of the variance. I do not know how to do this in R. Does anyone have a solution?
By Using subset() R base also provides a subset() function that can be used to select rows based on the logical condition of a column.
By using bracket notation on R DataFrame (data.name) we can select rows by column value, by index, by name, by condition e.t.c. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.
To choose the first row by the group in R, use the dplyr package as demonstrated in the code below. The data are sorted in ascending order by arrange() by default, however, we may easily sort the values in descending order instead.
Select Rows by list of Column Values. By using the same notation you can also use an operator %in% to select the DataFrame rows based on a list of values. The following example returns all rows when state values are present in vector values c('CA','AZ','PH') .
Sure. Something like this should do the trick:
# CREATE A REPRODUCIBLE EXAMPLE!
df <- data.frame(year = c("2001", "2003", "2001", "2003", "2003"),
pixel = c("a", "b", "a", "b", "a"),
cumsum = c(99, 99, 98, 99, 99),
numbercomps=1:5)
df
# year pixel cumsum numbercomps
# 1 2001 a 99 1
# 2 2003 b 99 2
# 3 2001 a 98 3
# 4 2003 b 99 4
# 5 2003 a 99 5
# EXTRACT THE SUBSET YOU'D LIKE.
res <- subset(df, cumsum>=99)
res <- subset(res,
subset = !duplicated(res[c("year", "pixel")]),
select = c("pixel", "year", "numbercomps"))
# pixel year numbercomps
# 1 a 2001 1
# 2 b 2003 2
# 5 a 2003 5
EDIT Also, for those interested in data.table
, there is this:
library(data.table)
dt <- data.table(df, key="pixel, year")
dt[cumsum>=99, .SD[1], by=key(dt)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With