I have a data frame ordered by id variables ("city"), and I want to keep the second observation of those cities that have more than one observation. For example, here's an example data set: <pre class="prettyprint"><code>city <- c(1,1,2,3,3,4,5,6,7,7,8) value <- c(3,5,7,8,2,5,4,2,3,2,3) mydata <- data.frame(city, value) </code></pre> Then we have: <pre class="prettyprint"><code> city value 1 1 3 2 1 5 3 2 7 4 3 8 5 3 2 6 4 5 7 5 4 8 6 2 9 7 3 10 7 2 11 8 3 </code></pre> The ideal outcome would be: <pre class="prettyprint"><code> city value 2 1 5 3 2 7 5 3 2 6 4 5 7 5 4 8 6 2 10 7 2 11 8 3 </code></pre> Any help is appreciated!

You may try <pre class="prettyprint"><code>library(dplyr) mydata %>% group_by(city) %>% filter(case_when(n()> 1 ~ row_number() == 2, TRUE ~ row_number()== 1)) city value <dbl> <dbl> 1 1 5 2 2 7 3 3 2 4 4 5 5 5 4 6 6 2 7 7 2 8 8 3 </code></pre>

Another <code>dplyr</code> solution: <pre class="prettyprint"><code>mydata %>% group_by(city) %>% summarize(value=value[pmin(2, n())]) </code></pre> Or: <pre class="prettyprint"><code>mydata %>% group_by(city) %>% summarize(value=ifelse(n() >= 2, value[2], value[1])) </code></pre> Both Output: <pre class="prettyprint"><code> city value <dbl> <dbl> 1 1 5 2 2 7 3 3 2 4 4 5 5 5 4 6 6 2 7 7 2 8 8 3 </code></pre>

If base R is ok try this: EDIT (since performance really seems to be important): Using <code>if</code> as a function, should give a 100-fold speed-up in some cases. <pre class="prettyprint"><code>aggregate( value ~ city, mydata, function(x) `if`(!is.na(x[2]),x[2],x[1]) ) city value 1 1 5 2 2 7 3 3 2 4 4 5 5 5 4 6 6 2 7 7 2 8 8 3 </code></pre> <h4>Benchmarks</h4> Here're some benchmarks because I was curious. I gathered all solutions and let them run through <code>microbenchmark</code>. Bottom line is <code>'if'(cond,T,F)</code> is fastest (22.3% faster than <code>ifelse</code> and 17-times faster than the slowest), followed by <code>ifelse</code> and <code>aggregate(pmin)</code>. Keep in mind that the <code>data.table</code> solution only ran on one core. So all speed-up in that package comes from parallelization. No real shocker but interesting nonetheless. <pre class="prettyprint"><code>library(microbenchmark) lengths( mydata ) city value 20000 20000 c( class(mydata$value), class(mydata$value) ) [1] "integer" "integer" microbenchmark("aggr_if_function" = { res <- aggregate( value ~ city, mydata, function(x) `if`(!is.na(x[2]),x[2],x[1]) )}, "aggr_ifelse" = { res <- aggregate( value ~ city, mydata, function(x) ifelse(!is.na(x[2]),x[2],x[1]) ) }, "dplyr_filter" = { res <- mydata %>% group_by(city) %>% filter(n() == 1L | row_number() == 2L) %>% ungroup() }, "dplyr_slice" = { res <- mydata %>% group_by(city) %>% slice(min(n(), 2)) %>% ungroup() }, "data.table_single_core" = { res <- DT[, .SD[min(.N, 2),], by = city] }, "aggr_pmin" = { res <- aggregate( value ~ city, mydata, function(x) x[ pmin(2, length(x))] ) }, "dplyr_filter_case_when" = { res <- mydata %>% group_by(city) %>% filter(case_when(n()> 1 ~ row_number() == 2, TRUE ~ row_number()== 1)) }, "group_split_purrr" = { res <- group_split(mydata, city) %>% map_if(~nrow(.) > 1, ~.[2, ]) %>% bind_rows() }, times=50) Unit: milliseconds expr min lq mean median uq aggr_if_function 175.5104 179.3273 184.5157 182.1778 186.8963 aggr_ifelse 214.5846 220.7074 229.2062 228.0688 234.1087 dplyr_filter 585.5275 607.7011 643.6320 632.0794 660.8184 dplyr_slice 713.4047 762.9887 792.7491 780.8475 803.7191 data.table_single_core 2080.3869 2164.3829 2240.8578 2229.5310 2298.9002 aggr_pmin 321.5265 330.5491 343.2752 341.7866 352.2880 dplyr_filter_case_when 3171.4859 3337.1669 3492.6915 3500.7783 3608.1809 group_split_purrr 1466.4527 1543.2597 1590.9994 1588.0186 1630.5590 max neval cld 212.6006 50 a 253.0433 50 a 1066.6018 50 c 1304.4045 50 d 2702.4201 50 f 457.3435 50 b 4195.0774 50 g 1786.5310 50 e </code></pre>

Combining <code>group_split</code> and <code>map_if</code>: <pre class="prettyprint lang-r prettyprint-override"><code>library(tidyverse) city <- c(1,1,2,3,3,4,5,6,7,7,8) value <- c(3,5,7,8,2,5,4,2,3,2,3) value2 <- c(3,5,7,8,2,5,4,2,3,2,3) mydata <- data.frame(city, value) group_split(mydata, city) %>% map_if(~nrow(.) > 1, ~.[2, ]) %>% bind_rows() #> # A tibble: 8 × 2 #> city value #> <dbl> <dbl> #> 1 1 5 #> 2 2 7 #> 3 3 2 #> 4 4 5 #> 5 5 4 #> 6 6 2 #> 7 7 2 #> 8 8 3 </code></pre> Created on 2021-11-30 by the reprex package (v2.0.1)

Keep only the second observation per group in R

Tags:

dataframe

r

dplyr

I have a data frame ordered by id variables ("city"), and I want to keep the second observation of those cities that have more than one observation.

For example, here's an example data set:

city <- c(1,1,2,3,3,4,5,6,7,7,8)
value <- c(3,5,7,8,2,5,4,2,3,2,3)
mydata <- data.frame(city, value)

Then we have:

   city value
1     1     3
2     1     5
3     2     7
4     3     8
5     3     2
6     4     5
7     5     4
8     6     2
9     7     3
10    7     2
11    8     3

The ideal outcome would be:

   city value
2     1     5
3     2     7
5     3     2
6     4     5
7     5     4
8     6     2
10    7     2
11    8     3

Any help is appreciated!

880

asked Dec 01 '21 02:12

Miguel Moreira

Video Answer

6 Answers

dplyr

library(dplyr)
mydata %>%
  group_by(city) %>%
  filter(n() == 1L | row_number() == 2L) %>%
  ungroup()
# # A tibble: 8 x 2
#    city value
#   <dbl> <dbl>
# 1     1     5
# 2     2     7
# 3     3     2
# 4     4     5
# 5     5     4
# 6     6     2
# 7     7     2
# 8     8     3

or slightly different

mydata %>%
  group_by(city) %>%
  slice(min(n(), 2)) %>%
  ungroup()

base R

ind <- ave(rep(TRUE, nrow(mydata)), mydata$city,
           FUN = function(z) length(z) == 1L | seq_along(z) == 2L)
ind
#  [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
mydata[ind,]
#    city value
# 2     1     5
# 3     2     7
# 5     3     2
# 6     4     5
# 7     5     4
# 8     6     2
# 10    7     2
# 11    8     3

data.table

Since you mentioned "is way bigger", you might consider data.table at some point for its speed and referential semantics. (And it doesn't hurt that this code is much more terse :-)

library(data.table)
DT <- as.data.table(mydata) # normally one might use setDT(mydata) instead ...
DT[, .SD[min(.N, 2),], by = city]
#     city value
#    <num> <num>
# 1:     1     5
# 2:     2     7
# 3:     3     2
# 4:     4     5
# 5:     5     4
# 6:     6     2
# 7:     7     2
# 8:     8     3

138

answered Nov 08 '22 13:11

r2evans

Here is logic that uses pmin() to choose either 2 or 1 depending on the length of the vector of value-values:

aggregate( value ~ city, mydata, function(x) x[ pmin(2, length(x))] )
  city value
1    1     5
2    2     7
3    3     2
4    4     5
5    5     4
6    6     2
7    7     2
8    8     3

The aggregate function delivers vectors of value split on the basis of city-values.

answered Nov 08 '22 13:11

IRTFM

You may try

library(dplyr)

mydata %>%
  group_by(city) %>%
  filter(case_when(n()> 1 ~ row_number() == 2,
                   TRUE ~ row_number()== 1)) 

  
   city value
  <dbl> <dbl>
1     1     5
2     2     7
3     3     2
4     4     5
5     5     4
6     6     2
7     7     2
8     8     3

answered Nov 08 '22 12:11

Park

Another dplyr solution:

mydata %>% group_by(city) %>%    
  summarize(value=value[pmin(2, n())])

Or:

mydata %>% group_by(city) %>%
  summarize(value=ifelse(n() >= 2, value[2], value[1]))

Both Output:

   city value
  <dbl> <dbl>
1     1     5
2     2     7
3     3     2
4     4     5
5     5     4
6     6     2
7     7     2
8     8     3

answered Nov 08 '22 11:11

U12-Forward

If base R is ok try this:

EDIT (since performance really seems to be important):

Using if as a function, should give a 100-fold speed-up in some cases.

aggregate( value ~ city, mydata, function(x) `if`(!is.na(x[2]),x[2],x[1]) )

  city value
1    1     5
2    2     7
3    3     2
4    4     5
5    5     4
6    6     2
7    7     2
8    8     3

Benchmarks

Here're some benchmarks because I was curious. I gathered all solutions and let them run through microbenchmark.

Bottom line is 'if'(cond,T,F) is fastest (22.3% faster than ifelse and 17-times faster than the slowest), followed by ifelse and aggregate(pmin). Keep in mind that the data.table solution only ran on one core. So all speed-up in that package comes from parallelization. No real shocker but interesting nonetheless.

library(microbenchmark)

lengths( mydata )
 city value 
20000 20000

c( class(mydata$value), class(mydata$value) )
[1] "integer" "integer"

microbenchmark("aggr_if_function" = { res <- aggregate( value ~ city, mydata, function(x) `if`(!is.na(x[2]),x[2],x[1]) )},
"aggr_ifelse" = { res <- aggregate( value ~ city, mydata, function(x) ifelse(!is.na(x[2]),x[2],x[1]) ) },
"dplyr_filter" = { res <- mydata %>% group_by(city) %>% filter(n() == 1L | row_number() == 2L) %>% ungroup() },
"dplyr_slice" = { res <- mydata %>% group_by(city) %>% slice(min(n(), 2)) %>% ungroup() },
"data.table_single_core" = { res <- DT[, .SD[min(.N, 2),], by = city] },
"aggr_pmin" = { res <- aggregate( value ~ city, mydata, function(x) x[ pmin(2, length(x))] ) },
"dplyr_filter_case_when" = { res <- mydata %>% group_by(city) %>% filter(case_when(n()> 1 ~ row_number() == 2, TRUE ~ row_number()== 1)) },
"group_split_purrr" = { res <- group_split(mydata, city) %>% map_if(~nrow(.) > 1, ~.[2, ]) %>% bind_rows() }, times=50)

Unit: milliseconds
                   expr       min        lq      mean    median        uq
       aggr_if_function  175.5104  179.3273  184.5157  182.1778  186.8963
            aggr_ifelse  214.5846  220.7074  229.2062  228.0688  234.1087
           dplyr_filter  585.5275  607.7011  643.6320  632.0794  660.8184
            dplyr_slice  713.4047  762.9887  792.7491  780.8475  803.7191
 data.table_single_core 2080.3869 2164.3829 2240.8578 2229.5310 2298.9002
              aggr_pmin  321.5265  330.5491  343.2752  341.7866  352.2880
 dplyr_filter_case_when 3171.4859 3337.1669 3492.6915 3500.7783 3608.1809
      group_split_purrr 1466.4527 1543.2597 1590.9994 1588.0186 1630.5590
       max neval     cld
  212.6006    50 a      
  253.0433    50 a      
 1066.6018    50   c    
 1304.4045    50    d   
 2702.4201    50      f 
  457.3435    50  b     
 4195.0774    50       g
 1786.5310    50     e

answered Nov 08 '22 13:11

Andre Wildberg

Combining group_split and map_if:

library(tidyverse)

city <- c(1,1,2,3,3,4,5,6,7,7,8)
value <- c(3,5,7,8,2,5,4,2,3,2,3)
value2 <- c(3,5,7,8,2,5,4,2,3,2,3)
mydata <- data.frame(city, value)

group_split(mydata, city) %>% 
    map_if(~nrow(.) > 1, ~.[2, ]) %>% bind_rows()
#> # A tibble: 8 × 2
#>    city value
#>   <dbl> <dbl>
#> 1     1     5
#> 2     2     7
#> 3     3     2
#> 4     4     5
#> 5     5     4
#> 6     6     2
#> 7     7     2
#> 8     8     3

^{Created on 2021-11-30 by the reprex package (v2.0.1)}

answered Nov 08 '22 13:11

jpdugo17

Related questions
                            
                                Check if value == integer(0) in R [duplicate]
                            
                                How to export images of diagrammer in R
                            
                                Sequence of two numbers with decreasing occurrence of one of them
                            
                                Starting Shiny app after password input (with Shinydashboard)
                            
                                ggplot2: how to color a graph by multiple variables
                            
                                Using dplyr and stringr to replace all values starts with
                            
                                R, loading rJava error
                            
                                Remove duplicated elements from list
                            
                                Renaming and Hiding an Exported Rcpp function in an R Package
                            
                                Drawing simple mediation diagram in R
                            
                                Turn off verbose messages when loading tidyverse using library() function
                            
                                How to remove rows where all columns are zero using dplyr pipe
                            
                                gam plots with ggplot
                            
                                In R, how to remove everything before the last slash
                            
                                Apply bold font on specific axis ticks
                            
                                How to merge two true/false variables in R?
                            
                                Python f-string equivalent in R?
                            
                                Looping linear regression output in a data frame in r
                            
                                R - more effective left_join [duplicate]
                            
                                Count how many times a value appears and adding the result to a column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Keep only the second observation per group in R

Tags:

dataframe

r

dplyr

Miguel Moreira

People also ask

Video Answer

6 Answers

dplyr

base R

data.table

r2evans

IRTFM

Park

U12-Forward

Benchmarks

Andre Wildberg

jpdugo17

Recent Activity

Donate For Us