Here is my example data set:
Name Course Cateory
1: Jason ML PT
2: Jason ML DI
3: Jason ML GT
4: Jason ML SY
5: Jason DS SY
6: Jason DS DI
7: Nancy ML PT
8: Nancy ML SY
9: Nancy DS DI
10: Nancy DS GT
11: James ML SY
12: John DS GT
I want to delete the duplicate rows to have unique rows across the dataframe. Deleting the duplicate rows is based on the values from the column category
. The preference for values in the category
column are given in this order {'PT','DI','GT','SY'}.
My output dataframe looks like below:
Name Course Cateory
1: Jason ML PT
2: Jason DS DI
3: Nancy ML PT
4: Nancy DS DI
5: James ML SY
6: John DS GT
Currently, I am using a combination of for
loop and if
condition. Since the Input dataframe is massive (10 Million rows), it takes forever. Is there a better and efficient way to perform the same?
In Excel, there are several ways to filter for unique values—or remove duplicate values: To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates.
To remove duplicates in R, Use duplicated() method: It identifies the duplicate elements. Using unique() method: It extracts unique elements. dplyr package's distinct() function: Removing duplicate rows from a data frame.
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
Here is a snippet that does what you asked:
df$Category <- factor(df$Category, levels = c("PT", "DI", "GT", "SY"))
df <- df[order(df$Category),]
df[!duplicated(df[,c('Name', 'Course')]),]
Output:
Name Course Category
Jason ML PT
Nancy ML PT
Jason DS DI
Nancy DS DI
John DS GT
James ML SY
Idea is that we sort based on the priority structure. Then we apply the unique operations, which will return the first match. The return will be what we want.
Since you mentioned you have 10 million rows, here is a data.table
solution:
library(data.table)
setDT(df)[, .SD[which.min(factor(Category, levels = c("PT","DI","GT","SY")))], by=.(Name, Course)]
Result:
Name Course Category
1: Jason ML PT
2: Jason DS DI
3: Nancy ML PT
4: Nancy DS DI
5: James ML SY
6: John DS GT
Benchmarking:
# Random resampling of `df` to generate 10 million rows
set.seed(123)
df_large = data.frame(lapply(df, sample, 1e7, replace = TRUE))
# Data prep Base R
df1 <- df_large
df1$Category <- factor(df1$Category, levels = c("PT", "DI", "GT", "SY"))
df1 <- df1[order(df1$Category), ]
# Data prep data.table
df2 <- df_large
df2$Category <- factor(df2$Category, levels = c("PT", "DI", "GT", "SY"))
setDT(df2)
Results:
library(microbenchmark)
microbenchmark(df1[!duplicated(df1[,c('Name', 'Course')]), ],
df2[, .SD[which.min(df2$Category)], by=.(Name, Course)])
Unit: milliseconds
expr min lq mean
df1[!duplicated(df1[, c("Name", "Course")]), ] 1696.7585 1719.4932 1788.5821
df2[, .SD[which.min(df2$Category)], by = .(Name, Course)] 387.8435 409.9365 436.4381
median uq max neval
1774.3131 1803.7565 2085.9722 100
427.6739 451.1776 558.2749 100
Data:
df = structure(list(Name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 4L,
4L, 4L, 4L, 1L, 3L), .Label = c("James", "Jason", "John", "Nancy"
), class = "factor"), Course = structure(c(2L, 2L, 2L, 2L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 1L), .Label = c("DS", "ML"), class = "factor"),
Category = structure(c(3L, 1L, 2L, 4L, 4L, 1L, 3L, 4L, 1L,
2L, 4L, 2L), .Label = c("DI", "GT", "PT", "SY"), class = "factor")), .Names = c("Name",
"Course", "Category"), class = "data.frame", row.names = c("1:",
"2:", "3:", "4:", "5:", "6:", "7:", "8:", "9:", "10:", "11:",
"12:"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With