Here is my example data set: <pre class="prettyprint"><code> Name Course Cateory 1: Jason ML PT 2: Jason ML DI 3: Jason ML GT 4: Jason ML SY 5: Jason DS SY 6: Jason DS DI 7: Nancy ML PT 8: Nancy ML SY 9: Nancy DS DI 10: Nancy DS GT 11: James ML SY 12: John DS GT </code></pre> I want to delete the duplicate rows to have unique rows across the dataframe. Deleting the duplicate rows is based on the values from the column <code>category</code>. The preference for values in the <code>category</code> column are given in this order {'PT','DI','GT','SY'}. My output dataframe looks like below: <pre class="prettyprint"><code> Name Course Cateory 1: Jason ML PT 2: Jason DS DI 3: Nancy ML PT 4: Nancy DS DI 5: James ML SY 6: John DS GT </code></pre> Currently, I am using a combination of <code>for</code> loop and <code>if</code> condition. Since the Input dataframe is massive (10 Million rows), it takes forever. Is there a better and efficient way to perform the same?

Here is a snippet that does what you asked: <pre class="prettyprint"><code>df$Category <- factor(df$Category, levels = c("PT", "DI", "GT", "SY")) df <- df[order(df$Category),] df[!duplicated(df[,c('Name', 'Course')]),] </code></pre> Output: <pre class="prettyprint"><code>Name Course Category Jason ML PT Nancy ML PT Jason DS DI Nancy DS DI John DS GT James ML SY </code></pre> Idea is that we sort based on the priority structure. Then we apply the unique operations, which will return the first match. The return will be what we want.

Since you mentioned you have 10 million rows, here is a <code>data.table</code> solution: <pre class="prettyprint"><code>library(data.table) setDT(df)[, .SD[which.min(factor(Category, levels = c("PT","DI","GT","SY")))], by=.(Name, Course)] </code></pre> Result: <pre class="prettyprint"><code> Name Course Category 1: Jason ML PT 2: Jason DS DI 3: Nancy ML PT 4: Nancy DS DI 5: James ML SY 6: John DS GT </code></pre> Benchmarking: <pre class="prettyprint"><code># Random resampling of `df` to generate 10 million rows set.seed(123) df_large = data.frame(lapply(df, sample, 1e7, replace = TRUE)) # Data prep Base R df1 <- df_large df1$Category <- factor(df1$Category, levels = c("PT", "DI", "GT", "SY")) df1 <- df1[order(df1$Category), ] # Data prep data.table df2 <- df_large df2$Category <- factor(df2$Category, levels = c("PT", "DI", "GT", "SY")) setDT(df2) </code></pre> Results: <pre class="prettyprint"><code>library(microbenchmark) microbenchmark(df1[!duplicated(df1[,c('Name', 'Course')]), ], df2[, .SD[which.min(df2$Category)], by=.(Name, Course)]) Unit: milliseconds expr min lq mean df1[!duplicated(df1[, c("Name", "Course")]), ] 1696.7585 1719.4932 1788.5821 df2[, .SD[which.min(df2$Category)], by = .(Name, Course)] 387.8435 409.9365 436.4381 median uq max neval 1774.3131 1803.7565 2085.9722 100 427.6739 451.1776 558.2749 100 </code></pre> Data: <pre class="prettyprint"><code>df = structure(list(Name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 1L, 3L), .Label = c("James", "Jason", "John", "Nancy" ), class = "factor"), Course = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L), .Label = c("DS", "ML"), class = "factor"), Category = structure(c(3L, 1L, 2L, 4L, 4L, 1L, 3L, 4L, 1L, 2L, 4L, 2L), .Label = c("DI", "GT", "PT", "SY"), class = "factor")), .Names = c("Name", "Course", "Category"), class = "data.frame", row.names = c("1:", "2:", "3:", "4:", "5:", "6:", "7:", "8:", "9:", "10:", "11:", "12:")) </code></pre>

R: Remove duplicates from a dataframe based on categories in a column

Here is my example data set:

      Name Course Cateory
 1: Jason     ML      PT
 2: Jason     ML      DI
 3: Jason     ML      GT
 4: Jason     ML      SY
 5: Jason     DS      SY
 6: Jason     DS      DI
 7: Nancy     ML      PT
 8: Nancy     ML      SY
 9: Nancy     DS      DI
10: Nancy     DS      GT
11: James     ML      SY
12:  John     DS      GT

I want to delete the duplicate rows to have unique rows across the dataframe. Deleting the duplicate rows is based on the values from the column category. The preference for values in the category column are given in this order {'PT','DI','GT','SY'}.

My output dataframe looks like below:

  Name Course Cateory
1: Jason     ML      PT
2: Jason     DS      DI
3: Nancy     ML      PT
4: Nancy     DS      DI
5: James     ML      SY
6:  John     DS      GT

Currently, I am using a combination of for loop and if condition. Since the Input dataframe is massive (10 Million rows), it takes forever. Is there a better and efficient way to perform the same?

How do I remove duplicates based on criteria?

In Excel, there are several ways to filter for unique values—or remove duplicate values: To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates.

How do I remove duplicate values from a column in R?

To remove duplicates in R, Use duplicated() method: It identifies the duplicate elements. Using unique() method: It extracts unique elements. dplyr package's distinct() function: Removing duplicate rows from a data frame.

How do I remove duplicates from a column in a DataFrame?

To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.

Here is a snippet that does what you asked:

df$Category <- factor(df$Category, levels = c("PT", "DI", "GT", "SY"))

df <- df[order(df$Category),]

df[!duplicated(df[,c('Name', 'Course')]),]

Output:

Name Course Category
Jason     ML       PT
Nancy     ML       PT
Jason     DS       DI
Nancy     DS       DI
John      DS       GT
James     ML       SY

Idea is that we sort based on the priority structure. Then we apply the unique operations, which will return the first match. The return will be what we want.

Since you mentioned you have 10 million rows, here is a data.table solution:

library(data.table)

setDT(df)[, .SD[which.min(factor(Category, levels = c("PT","DI","GT","SY")))], by=.(Name, Course)]

Result:

    Name Course Category
1: Jason     ML       PT
2: Jason     DS       DI
3: Nancy     ML       PT
4: Nancy     DS       DI
5: James     ML       SY
6:  John     DS       GT

Benchmarking:

# Random resampling of `df` to generate 10 million rows
set.seed(123)
df_large = data.frame(lapply(df, sample, 1e7, replace = TRUE))

# Data prep Base R  
df1 <- df_large

df1$Category <- factor(df1$Category, levels = c("PT", "DI", "GT", "SY"))

df1 <- df1[order(df1$Category), ]

# Data prep data.table
df2 <- df_large

df2$Category <- factor(df2$Category, levels = c("PT", "DI", "GT", "SY"))

setDT(df2)

Results:

library(microbenchmark)
microbenchmark(df1[!duplicated(df1[,c('Name', 'Course')]), ], 
               df2[, .SD[which.min(df2$Category)], by=.(Name, Course)])

Unit: milliseconds
                                                      expr       min        lq      mean
            df1[!duplicated(df1[, c("Name", "Course")]), ] 1696.7585 1719.4932 1788.5821
 df2[, .SD[which.min(df2$Category)], by = .(Name, Course)]  387.8435  409.9365  436.4381
    median        uq       max neval
 1774.3131 1803.7565 2085.9722   100
  427.6739  451.1776  558.2749   100

Data:

df = structure(list(Name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 4L, 
4L, 4L, 4L, 1L, 3L), .Label = c("James", "Jason", "John", "Nancy"
), class = "factor"), Course = structure(c(2L, 2L, 2L, 2L, 1L, 
1L, 2L, 2L, 1L, 1L, 2L, 1L), .Label = c("DS", "ML"), class = "factor"), 
    Category = structure(c(3L, 1L, 2L, 4L, 4L, 1L, 3L, 4L, 1L, 
    2L, 4L, 2L), .Label = c("DI", "GT", "PT", "SY"), class = "factor")), .Names = c("Name", 
"Course", "Category"), class = "data.frame", row.names = c("1:", 
"2:", "3:", "4:", "5:", "6:", "7:", "8:", "9:", "10:", "11:", 
"12:"))

R: Remove duplicates from a dataframe based on categories in a column

Tags:

r

Jason Mathews

People also ask

2 Answers

chn

acylam

Recent Activity

Donate For Us

R: Remove duplicates from a dataframe based on categories in a column

Tags:

r

Jason Mathews

People also ask

2 Answers

chn

acylam

Related questions

Recent Activity

Donate For Us