Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keep only the second observation per group in R

Tags:

dataframe

r

dplyr

I have a data frame ordered by id variables ("city"), and I want to keep the second observation of those cities that have more than one observation.

For example, here's an example data set:

city <- c(1,1,2,3,3,4,5,6,7,7,8)
value <- c(3,5,7,8,2,5,4,2,3,2,3)
mydata <- data.frame(city, value)

Then we have:

   city value
1     1     3
2     1     5
3     2     7
4     3     8
5     3     2
6     4     5
7     5     4
8     6     2
9     7     3
10    7     2
11    8     3

The ideal outcome would be:

   city value
2     1     5
3     2     7
5     3     2
6     4     5
7     5     4
8     6     2
10    7     2
11    8     3

Any help is appreciated!

like image 880
Miguel Moreira Avatar asked Dec 01 '21 02:12

Miguel Moreira


People also ask

How to count observations by group in R?

How to Count Observations by Group in R Often you may be interested in counting the number of observationsby group in R. Fortunately this is easy to do using the count()function from the dplyrlibrary. This tutorial explains several examples of how to use this function in practice using the following data frame:

How do I keep only the n observations with the highest values?

This example shows how to keep only the N observations with the highest values by group using the functions of the dplyr package. Next, we can use the arrange, desc, group_by, and slice functions to return a tibble containing only the three highest values in each group:

How do I Group and summarize data in R?

Two of the most common tasks that you’ll perform in data analysis are grouping and summarizing data. Fortunately the dplyr package in R allows you to quickly group and summarize data. This tutorial provides a quick guide to getting started with dplyr. Before you can use the functions in the dplyr package, you must first load the package:

How to return N highest data points of each group in R?

In Example 1, I’ll show how to return the N highest data points of each group using the basic installation of the R programming language. For this, we first have to sort our data based on the value column in descending order: As next step, we have to apply the Reduce, rbind, and head functions as shown below:


Video Answer


6 Answers

dplyr

library(dplyr)
mydata %>%
  group_by(city) %>%
  filter(n() == 1L | row_number() == 2L) %>%
  ungroup()
# # A tibble: 8 x 2
#    city value
#   <dbl> <dbl>
# 1     1     5
# 2     2     7
# 3     3     2
# 4     4     5
# 5     5     4
# 6     6     2
# 7     7     2
# 8     8     3

or slightly different

mydata %>%
  group_by(city) %>%
  slice(min(n(), 2)) %>%
  ungroup()

base R

ind <- ave(rep(TRUE, nrow(mydata)), mydata$city,
           FUN = function(z) length(z) == 1L | seq_along(z) == 2L)
ind
#  [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
mydata[ind,]
#    city value
# 2     1     5
# 3     2     7
# 5     3     2
# 6     4     5
# 7     5     4
# 8     6     2
# 10    7     2
# 11    8     3

data.table

Since you mentioned "is way bigger", you might consider data.table at some point for its speed and referential semantics. (And it doesn't hurt that this code is much more terse :-)

library(data.table)
DT <- as.data.table(mydata) # normally one might use setDT(mydata) instead ...
DT[, .SD[min(.N, 2),], by = city]
#     city value
#    <num> <num>
# 1:     1     5
# 2:     2     7
# 3:     3     2
# 4:     4     5
# 5:     5     4
# 6:     6     2
# 7:     7     2
# 8:     8     3
like image 138
r2evans Avatar answered Nov 08 '22 13:11

r2evans


Here is logic that uses pmin() to choose either 2 or 1 depending on the length of the vector of value-values:

aggregate( value ~ city, mydata, function(x) x[ pmin(2, length(x))] )
  city value
1    1     5
2    2     7
3    3     2
4    4     5
5    5     4
6    6     2
7    7     2
8    8     3

The aggregate function delivers vectors of value split on the basis of city-values.

like image 28
IRTFM Avatar answered Nov 08 '22 13:11

IRTFM


You may try

library(dplyr)

mydata %>%
  group_by(city) %>%
  filter(case_when(n()> 1 ~ row_number() == 2,
                   TRUE ~ row_number()== 1)) 

  
   city value
  <dbl> <dbl>
1     1     5
2     2     7
3     3     2
4     4     5
5     5     4
6     6     2
7     7     2
8     8     3
like image 42
Park Avatar answered Nov 08 '22 12:11

Park


Another dplyr solution:

mydata %>% group_by(city) %>%    
  summarize(value=value[pmin(2, n())])

Or:

mydata %>% group_by(city) %>%
  summarize(value=ifelse(n() >= 2, value[2], value[1]))

Both Output:

   city value
  <dbl> <dbl>
1     1     5
2     2     7
3     3     2
4     4     5
5     5     4
6     6     2
7     7     2
8     8     3
like image 45
U12-Forward Avatar answered Nov 08 '22 11:11

U12-Forward


If base R is ok try this:

EDIT (since performance really seems to be important):

Using if as a function, should give a 100-fold speed-up in some cases.

aggregate( value ~ city, mydata, function(x) `if`(!is.na(x[2]),x[2],x[1]) )

  city value
1    1     5
2    2     7
3    3     2
4    4     5
5    5     4
6    6     2
7    7     2
8    8     3

Benchmarks

Here're some benchmarks because I was curious. I gathered all solutions and let them run through microbenchmark.

Bottom line is 'if'(cond,T,F) is fastest (22.3% faster than ifelse and 17-times faster than the slowest), followed by ifelse and aggregate(pmin). Keep in mind that the data.table solution only ran on one core. So all speed-up in that package comes from parallelization. No real shocker but interesting nonetheless.

library(microbenchmark)

lengths( mydata )
 city value 
20000 20000

c( class(mydata$value), class(mydata$value) )
[1] "integer" "integer"

microbenchmark("aggr_if_function" = { res <- aggregate( value ~ city, mydata, function(x) `if`(!is.na(x[2]),x[2],x[1]) )},
"aggr_ifelse" = { res <- aggregate( value ~ city, mydata, function(x) ifelse(!is.na(x[2]),x[2],x[1]) ) },
"dplyr_filter" = { res <- mydata %>% group_by(city) %>% filter(n() == 1L | row_number() == 2L) %>% ungroup() },
"dplyr_slice" = { res <- mydata %>% group_by(city) %>% slice(min(n(), 2)) %>% ungroup() },
"data.table_single_core" = { res <- DT[, .SD[min(.N, 2),], by = city] },
"aggr_pmin" = { res <- aggregate( value ~ city, mydata, function(x) x[ pmin(2, length(x))] ) },
"dplyr_filter_case_when" = { res <- mydata %>% group_by(city) %>% filter(case_when(n()> 1 ~ row_number() == 2, TRUE ~ row_number()== 1)) },
"group_split_purrr" = { res <- group_split(mydata, city) %>% map_if(~nrow(.) > 1, ~.[2, ]) %>% bind_rows() }, times=50)

Unit: milliseconds
                   expr       min        lq      mean    median        uq
       aggr_if_function  175.5104  179.3273  184.5157  182.1778  186.8963
            aggr_ifelse  214.5846  220.7074  229.2062  228.0688  234.1087
           dplyr_filter  585.5275  607.7011  643.6320  632.0794  660.8184
            dplyr_slice  713.4047  762.9887  792.7491  780.8475  803.7191
 data.table_single_core 2080.3869 2164.3829 2240.8578 2229.5310 2298.9002
              aggr_pmin  321.5265  330.5491  343.2752  341.7866  352.2880
 dplyr_filter_case_when 3171.4859 3337.1669 3492.6915 3500.7783 3608.1809
      group_split_purrr 1466.4527 1543.2597 1590.9994 1588.0186 1630.5590
       max neval     cld
  212.6006    50 a      
  253.0433    50 a      
 1066.6018    50   c    
 1304.4045    50    d   
 2702.4201    50      f 
  457.3435    50  b     
 4195.0774    50       g
 1786.5310    50     e  
like image 25
Andre Wildberg Avatar answered Nov 08 '22 13:11

Andre Wildberg


Combining group_split and map_if:

library(tidyverse)

city <- c(1,1,2,3,3,4,5,6,7,7,8)
value <- c(3,5,7,8,2,5,4,2,3,2,3)
value2 <- c(3,5,7,8,2,5,4,2,3,2,3)
mydata <- data.frame(city, value)

group_split(mydata, city) %>% 
    map_if(~nrow(.) > 1, ~.[2, ]) %>% bind_rows()
#> # A tibble: 8 × 2
#>    city value
#>   <dbl> <dbl>
#> 1     1     5
#> 2     2     7
#> 3     3     2
#> 4     4     5
#> 5     5     4
#> 6     6     2
#> 7     7     2
#> 8     8     3

Created on 2021-11-30 by the reprex package (v2.0.1)

like image 32
jpdugo17 Avatar answered Nov 08 '22 13:11

jpdugo17