Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Remove duplicates from a dataframe based on categories in a column

Tags:

r

Here is my example data set:

      Name Course Cateory
 1: Jason     ML      PT
 2: Jason     ML      DI
 3: Jason     ML      GT
 4: Jason     ML      SY
 5: Jason     DS      SY
 6: Jason     DS      DI
 7: Nancy     ML      PT
 8: Nancy     ML      SY
 9: Nancy     DS      DI
10: Nancy     DS      GT
11: James     ML      SY
12:  John     DS      GT

I want to delete the duplicate rows to have unique rows across the dataframe. Deleting the duplicate rows is based on the values from the column category. The preference for values in the category column are given in this order {'PT','DI','GT','SY'}.

My output dataframe looks like below:

  Name Course Cateory
1: Jason     ML      PT
2: Jason     DS      DI
3: Nancy     ML      PT
4: Nancy     DS      DI
5: James     ML      SY
6:  John     DS      GT

Currently, I am using a combination of for loop and if condition. Since the Input dataframe is massive (10 Million rows), it takes forever. Is there a better and efficient way to perform the same?

like image 673
Jason Mathews Avatar asked Nov 28 '17 19:11

Jason Mathews


People also ask

How do I remove duplicates based on criteria?

In Excel, there are several ways to filter for unique values—or remove duplicate values: To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates.

How do I remove duplicate values from a column in R?

To remove duplicates in R, Use duplicated() method: It identifies the duplicate elements. Using unique() method: It extracts unique elements. dplyr package's distinct() function: Removing duplicate rows from a data frame.

How do I remove duplicates from a column in a DataFrame?

To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.


2 Answers

Here is a snippet that does what you asked:

df$Category <- factor(df$Category, levels = c("PT", "DI", "GT", "SY"))

df <- df[order(df$Category),]

df[!duplicated(df[,c('Name', 'Course')]),]

Output:

Name Course Category
Jason     ML       PT
Nancy     ML       PT
Jason     DS       DI
Nancy     DS       DI
John      DS       GT
James     ML       SY

Idea is that we sort based on the priority structure. Then we apply the unique operations, which will return the first match. The return will be what we want.

like image 163
chn Avatar answered Oct 10 '22 17:10

chn


Since you mentioned you have 10 million rows, here is a data.table solution:

library(data.table)

setDT(df)[, .SD[which.min(factor(Category, levels = c("PT","DI","GT","SY")))], by=.(Name, Course)]

Result:

    Name Course Category
1: Jason     ML       PT
2: Jason     DS       DI
3: Nancy     ML       PT
4: Nancy     DS       DI
5: James     ML       SY
6:  John     DS       GT

Benchmarking:

# Random resampling of `df` to generate 10 million rows
set.seed(123)
df_large = data.frame(lapply(df, sample, 1e7, replace = TRUE))

# Data prep Base R  
df1 <- df_large

df1$Category <- factor(df1$Category, levels = c("PT", "DI", "GT", "SY"))

df1 <- df1[order(df1$Category), ]

# Data prep data.table
df2 <- df_large

df2$Category <- factor(df2$Category, levels = c("PT", "DI", "GT", "SY"))

setDT(df2)

Results:

library(microbenchmark)
microbenchmark(df1[!duplicated(df1[,c('Name', 'Course')]), ], 
               df2[, .SD[which.min(df2$Category)], by=.(Name, Course)])

Unit: milliseconds
                                                      expr       min        lq      mean
            df1[!duplicated(df1[, c("Name", "Course")]), ] 1696.7585 1719.4932 1788.5821
 df2[, .SD[which.min(df2$Category)], by = .(Name, Course)]  387.8435  409.9365  436.4381
    median        uq       max neval
 1774.3131 1803.7565 2085.9722   100
  427.6739  451.1776  558.2749   100

Data:

df = structure(list(Name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 4L, 
4L, 4L, 4L, 1L, 3L), .Label = c("James", "Jason", "John", "Nancy"
), class = "factor"), Course = structure(c(2L, 2L, 2L, 2L, 1L, 
1L, 2L, 2L, 1L, 1L, 2L, 1L), .Label = c("DS", "ML"), class = "factor"), 
    Category = structure(c(3L, 1L, 2L, 4L, 4L, 1L, 3L, 4L, 1L, 
    2L, 4L, 2L), .Label = c("DI", "GT", "PT", "SY"), class = "factor")), .Names = c("Name", 
"Course", "Category"), class = "data.frame", row.names = c("1:", 
"2:", "3:", "4:", "5:", "6:", "7:", "8:", "9:", "10:", "11:", 
"12:"))
like image 32
acylam Avatar answered Oct 10 '22 18:10

acylam