I have a data.frame as simple as this one: <pre class="prettyprint"><code>id group idu value 1 1 1_1 34 2 1 2_1 23 3 1 3_1 67 4 2 4_2 6 5 2 5_2 24 6 2 6_2 45 1 3 1_3 34 2 3 2_3 67 3 3 3_3 76 </code></pre> from where I want to retrieve a subset with the first entries of each group; something like: <pre class="prettyprint"><code>id group idu value 1 1 1_1 34 4 2 4_2 6 1 3 1_3 34 </code></pre> id is not unique so the approach should not rely on it. Can I achieve this avoiding loops? <code>dput()</code> of data: <pre class="prettyprint"><code>structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L), group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), idu = structure(c(1L, 3L, 5L, 7L, 8L, 9L, 2L, 4L, 6L), .Label = c("1_1", "1_3", "2_1", "2_3", "3_1", "3_3", "4_2", "5_2", "6_2"), class = "factor"), value = c(34L, 23L, 67L, 6L, 24L, 45L, 34L, 67L, 76L)), .Names = c("id", "group", "idu", "value"), class = "data.frame", row.names = c(NA, -9L)) </code></pre>

Using Gavin's million row df: <pre class="prettyprint"><code>DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE), group = factor(rep(1:1000, each = 1000)), value = runif(1000000)) DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_"))) </code></pre> I think the fastest way is to reorder the data frame and then use <code>duplicated</code>: <pre class="prettyprint"><code>system.time({ DF4 <- DF3[order(DF3$group), ] out2 <- DF4[!duplicated(DF4$group), ] }) # user system elapsed # 0.335 0.107 0.441 </code></pre> This compares to 7 seconds for Gavin's fastet lapply + split method on my computer. Generally, when working with data frames, the fastest approach is usually to generate all the indices and then do a single subset.

Subset a data frame based on column entry (or rank)

Tags:

r

subset

I have a data.frame as simple as this one:

id group idu  value
1  1     1_1  34
2  1     2_1  23
3  1     3_1  67
4  2     4_2  6
5  2     5_2  24
6  2     6_2  45
1  3     1_3  34
2  3     2_3  67
3  3     3_3  76

from where I want to retrieve a subset with the first entries of each group; something like:

id group idu value
1  1     1_1 34
4  2     4_2 6
1  3     1_3 34

id is not unique so the approach should not rely on it.

Can I achieve this avoiding loops?

dput() of data:

structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L), group = c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), idu = structure(c(1L, 3L, 5L, 
7L, 8L, 9L, 2L, 4L, 6L), .Label = c("1_1", "1_3", "2_1", "2_3", 
"3_1", "3_3", "4_2", "5_2", "6_2"), class = "factor"), value = c(34L, 
23L, 67L, 6L, 24L, 45L, 34L, 67L, 76L)), .Names = c("id", "group", 
"idu", "value"), class = "data.frame", row.names = c(NA, -9L))

682

asked Apr 27 '11 13:04

Paulo E. Cardoso

1 Answers

Using Gavin's million row df:

DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE),
                  group = factor(rep(1:1000, each = 1000)),
                  value = runif(1000000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

I think the fastest way is to reorder the data frame and then use duplicated:

system.time({
  DF4 <- DF3[order(DF3$group), ]
  out2 <- DF4[!duplicated(DF4$group), ]
})
# user  system elapsed 
# 0.335   0.107   0.441

This compares to 7 seconds for Gavin's fastet lapply + split method on my computer.

Generally, when working with data frames, the fastest approach is usually to generate all the indices and then do a single subset.

answered Sep 21 '22 20:09

hadley

Related questions
                            
                                R determine image width and height in pixels
                            
                                How do I read an encrypted file from disk with R
                            
                                In dplyr, how to delete and rename columns that don't exist, manipulate all names, and name a new variable using a string?
                            
                                how to insert image from url in markdown
                            
                                when to use na.omit versus complete.cases
                            
                                directlabels: avoid clipping (like xpd=TRUE)
                            
                                Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : there is no package called 'stringi' [duplicate]
                            
                                Make package in R not required to load when I startup R/RStudio?
                            
                                R: as.POSIXct timezone and scale_x_datetime issues in my dataset
                            
                                Error when installing an R package from github: Could not find build tools necessary to build data.table
                            
                                Shiny R renderText paste new line and bold
                            
                                LaTeX math expression in knitr kable (Sweave)
                            
                                For R Markdown, How do I display a matrix from R variable
                            
                                tags$style specific modaldialog element in shiny
                            
                                Join datatables using column names stored in variables
                            
                                How to display emojis in ggplot2 using emo package in R?
                            
                                How can I declare a thousand separator in read.csv? [duplicate]
                            
                                ggplot: recommended colour palettes also distinguishable for B&W printing?
                            
                                Problem with error in ggplot: “Error in grid.Call(”L_textBounds“, as.graphicsAnnot(x$label), x$x, x$y, … ” [duplicate]
                            
                                Calculation of R^2 value for a non-linear regression

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With