Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Deleting rows in R conditionally

I have a data where the first column is a bunch of ID numbers (some repeat), and the second column is just a bunch of numbers. I need a way to keep each ID number only once based on the smallest number in the second column.

Row#   ID   Number
1      10     180
2      12     167
3      12     182
4      12     135
5      15     152
6      15     133

Ex: I only want to keep Row# 1, 4, and 6 here and delete the rest

like image 874
Swanny Avatar asked Jun 15 '15 17:06

Swanny


2 Answers

For selecting the row that has the minimum 'Number' for each 'ID' group, we can use one of the aggregating by group function. A base R option is aggregate. With aggregate, we can either use the 'formula' method or specify a list of grouping elements/variables with the by argument. Using the formula method, we get the min value of 'Number' for each 'ID'.

aggregate(Number~ID, df1, FUN=min)

Or we can use a faster option with data.table. Here, we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', we get the min value of "Number".

library(data.table)
setDT(df1)[, list(Number=min(Number)), by = ID] 

Or this can be also done with setorder to order the 'Number' column and use unique with by option to select the first non-duplicated 'ID' row. (from @David Arenburgs' comments)

 unique(setorder(setDT(df1), Number), by = "ID")

Or using dplyr, we group by 'ID' and get the subset rows with summarise.

library(dplyr)
df1 %>%
   group_by(ID) %>%
   summarise(Number= min(Number))

Or we can use sqldf syntax to get the subset of data.

library(sqldf)
sqldf('select ID,
        min(Number) as Number
        from df1 
        group by ID')

Update

If there are multiple columns and you want to get the row based on the minimum value of 'Number' for each 'ID', you can use which.min. Using .I will get the row index and that can be used for subsetting the rows.

setDT(df1)[df1[,  .I[which.min(Number)], by = ID]$V1]

Or with dplyr we use slice to filter out the rows that have the min value of 'Number' for each 'ID'

df1 %>% 
    group_by(ID) %>%
    slice(which.min(Number))
like image 193
akrun Avatar answered Sep 19 '22 02:09

akrun


Or

do.call(rbind, 
lapply(split(df1, df1$ID), function(x) subset(x, Number == min(Number))))

#ID Number
#10 10    180
#12 12    135
#15 15    133
like image 29
Veerendra Gadekar Avatar answered Sep 19 '22 02:09

Veerendra Gadekar