I have a data frame ordered by id variables ("city"), and I want to keep the second observation of those cities that have more than one observation.
For example, here's an example data set:
city <- c(1,1,2,3,3,4,5,6,7,7,8)
value <- c(3,5,7,8,2,5,4,2,3,2,3)
mydata <- data.frame(city, value)
Then we have:
city value
1 1 3
2 1 5
3 2 7
4 3 8
5 3 2
6 4 5
7 5 4
8 6 2
9 7 3
10 7 2
11 8 3
The ideal outcome would be:
city value
2 1 5
3 2 7
5 3 2
6 4 5
7 5 4
8 6 2
10 7 2
11 8 3
Any help is appreciated!
How to Count Observations by Group in R Often you may be interested in counting the number of observationsby group in R. Fortunately this is easy to do using the count()function from the dplyrlibrary. This tutorial explains several examples of how to use this function in practice using the following data frame:
This example shows how to keep only the N observations with the highest values by group using the functions of the dplyr package. Next, we can use the arrange, desc, group_by, and slice functions to return a tibble containing only the three highest values in each group:
Two of the most common tasks that you’ll perform in data analysis are grouping and summarizing data. Fortunately the dplyr package in R allows you to quickly group and summarize data. This tutorial provides a quick guide to getting started with dplyr. Before you can use the functions in the dplyr package, you must first load the package:
In Example 1, I’ll show how to return the N highest data points of each group using the basic installation of the R programming language. For this, we first have to sort our data based on the value column in descending order: As next step, we have to apply the Reduce, rbind, and head functions as shown below:
library(dplyr)
mydata %>%
group_by(city) %>%
filter(n() == 1L | row_number() == 2L) %>%
ungroup()
# # A tibble: 8 x 2
# city value
# <dbl> <dbl>
# 1 1 5
# 2 2 7
# 3 3 2
# 4 4 5
# 5 5 4
# 6 6 2
# 7 7 2
# 8 8 3
or slightly different
mydata %>%
group_by(city) %>%
slice(min(n(), 2)) %>%
ungroup()
ind <- ave(rep(TRUE, nrow(mydata)), mydata$city,
FUN = function(z) length(z) == 1L | seq_along(z) == 2L)
ind
# [1] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
mydata[ind,]
# city value
# 2 1 5
# 3 2 7
# 5 3 2
# 6 4 5
# 7 5 4
# 8 6 2
# 10 7 2
# 11 8 3
Since you mentioned "is way bigger", you might consider data.table
at some point for its speed and referential semantics. (And it doesn't hurt that this code is much more terse :-)
library(data.table)
DT <- as.data.table(mydata) # normally one might use setDT(mydata) instead ...
DT[, .SD[min(.N, 2),], by = city]
# city value
# <num> <num>
# 1: 1 5
# 2: 2 7
# 3: 3 2
# 4: 4 5
# 5: 5 4
# 6: 6 2
# 7: 7 2
# 8: 8 3
Here is logic that uses pmin()
to choose either 2 or 1 depending on the length of the vector of value
-values:
aggregate( value ~ city, mydata, function(x) x[ pmin(2, length(x))] )
city value
1 1 5
2 2 7
3 3 2
4 4 5
5 5 4
6 6 2
7 7 2
8 8 3
The aggregate function delivers vectors of value
split on the basis of city
-values.
You may try
library(dplyr)
mydata %>%
group_by(city) %>%
filter(case_when(n()> 1 ~ row_number() == 2,
TRUE ~ row_number()== 1))
city value
<dbl> <dbl>
1 1 5
2 2 7
3 3 2
4 4 5
5 5 4
6 6 2
7 7 2
8 8 3
Another dplyr
solution:
mydata %>% group_by(city) %>%
summarize(value=value[pmin(2, n())])
Or:
mydata %>% group_by(city) %>%
summarize(value=ifelse(n() >= 2, value[2], value[1]))
Both Output:
city value
<dbl> <dbl>
1 1 5
2 2 7
3 3 2
4 4 5
5 5 4
6 6 2
7 7 2
8 8 3
If base R is ok try this:
EDIT (since performance really seems to be important):
Using if
as a function, should give a 100-fold speed-up in some cases.
aggregate( value ~ city, mydata, function(x) `if`(!is.na(x[2]),x[2],x[1]) )
city value
1 1 5
2 2 7
3 3 2
4 4 5
5 5 4
6 6 2
7 7 2
8 8 3
Here're some benchmarks because I was curious. I gathered all solutions and let them run through microbenchmark
.
Bottom line is 'if'(cond,T,F)
is fastest (22.3% faster than ifelse
and 17-times faster than the slowest), followed by ifelse
and aggregate(pmin)
. Keep in mind that the data.table
solution only ran on one core. So all speed-up in that package comes from parallelization. No real shocker but interesting nonetheless.
library(microbenchmark)
lengths( mydata )
city value
20000 20000
c( class(mydata$value), class(mydata$value) )
[1] "integer" "integer"
microbenchmark("aggr_if_function" = { res <- aggregate( value ~ city, mydata, function(x) `if`(!is.na(x[2]),x[2],x[1]) )},
"aggr_ifelse" = { res <- aggregate( value ~ city, mydata, function(x) ifelse(!is.na(x[2]),x[2],x[1]) ) },
"dplyr_filter" = { res <- mydata %>% group_by(city) %>% filter(n() == 1L | row_number() == 2L) %>% ungroup() },
"dplyr_slice" = { res <- mydata %>% group_by(city) %>% slice(min(n(), 2)) %>% ungroup() },
"data.table_single_core" = { res <- DT[, .SD[min(.N, 2),], by = city] },
"aggr_pmin" = { res <- aggregate( value ~ city, mydata, function(x) x[ pmin(2, length(x))] ) },
"dplyr_filter_case_when" = { res <- mydata %>% group_by(city) %>% filter(case_when(n()> 1 ~ row_number() == 2, TRUE ~ row_number()== 1)) },
"group_split_purrr" = { res <- group_split(mydata, city) %>% map_if(~nrow(.) > 1, ~.[2, ]) %>% bind_rows() }, times=50)
Unit: milliseconds
expr min lq mean median uq
aggr_if_function 175.5104 179.3273 184.5157 182.1778 186.8963
aggr_ifelse 214.5846 220.7074 229.2062 228.0688 234.1087
dplyr_filter 585.5275 607.7011 643.6320 632.0794 660.8184
dplyr_slice 713.4047 762.9887 792.7491 780.8475 803.7191
data.table_single_core 2080.3869 2164.3829 2240.8578 2229.5310 2298.9002
aggr_pmin 321.5265 330.5491 343.2752 341.7866 352.2880
dplyr_filter_case_when 3171.4859 3337.1669 3492.6915 3500.7783 3608.1809
group_split_purrr 1466.4527 1543.2597 1590.9994 1588.0186 1630.5590
max neval cld
212.6006 50 a
253.0433 50 a
1066.6018 50 c
1304.4045 50 d
2702.4201 50 f
457.3435 50 b
4195.0774 50 g
1786.5310 50 e
Combining group_split
and map_if
:
library(tidyverse)
city <- c(1,1,2,3,3,4,5,6,7,7,8)
value <- c(3,5,7,8,2,5,4,2,3,2,3)
value2 <- c(3,5,7,8,2,5,4,2,3,2,3)
mydata <- data.frame(city, value)
group_split(mydata, city) %>%
map_if(~nrow(.) > 1, ~.[2, ]) %>% bind_rows()
#> # A tibble: 8 × 2
#> city value
#> <dbl> <dbl>
#> 1 1 5
#> 2 2 7
#> 3 3 2
#> 4 4 5
#> 5 5 4
#> 6 6 2
#> 7 7 2
#> 8 8 3
Created on 2021-11-30 by the reprex package (v2.0.1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With