I have a data.frame as simple as this one:
id group idu value
1 1 1_1 34
2 1 2_1 23
3 1 3_1 67
4 2 4_2 6
5 2 5_2 24
6 2 6_2 45
1 3 1_3 34
2 3 2_3 67
3 3 3_3 76
from where I want to retrieve a subset with the first entries of each group; something like:
id group idu value
1 1 1_1 34
4 2 4_2 6
1 3 1_3 34
id is not unique so the approach should not rely on it.
Can I achieve this avoiding loops?
dput()
of data:
structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L), group = c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), idu = structure(c(1L, 3L, 5L,
7L, 8L, 9L, 2L, 4L, 6L), .Label = c("1_1", "1_3", "2_1", "2_3",
"3_1", "3_3", "4_2", "5_2", "6_2"), class = "factor"), value = c(34L,
23L, 67L, 6L, 24L, 45L, 34L, 67L, 76L)), .Names = c("id", "group",
"idu", "value"), class = "data.frame", row.names = c(NA, -9L))
How to subset the data frame (DataFrame) by column value and name in R? By using R base df[] notation, or subset() you can easily subset the R Data Frame (data. frame) by column value or by column name.
A Row Subset is a selection of the rows within a whole table being viewed within the application, or equivalently a new table composed from some subset of its rows. You can define these and use them in several different ways; the usefulness comes from defining them in one context and using them in another.
By using bracket notation on R DataFrame (data.name) we can select rows by column value, by index, by name, by condition e.t.c. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.
Using Gavin's million row df:
DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE),
group = factor(rep(1:1000, each = 1000)),
value = runif(1000000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))
I think the fastest way is to reorder the data frame and then use duplicated
:
system.time({
DF4 <- DF3[order(DF3$group), ]
out2 <- DF4[!duplicated(DF4$group), ]
})
# user system elapsed
# 0.335 0.107 0.441
This compares to 7 seconds for Gavin's fastet lapply + split method on my computer.
Generally, when working with data frames, the fastest approach is usually to generate all the indices and then do a single subset.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With