Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I remove all duplicates so that NONE are left in a data frame?

There is a similar question for PHP, but I'm working with R and am unable to translate the solution to my problem.

I have this data frame with 10 rows and 50 columns, where some of the rows are absolutely identical. If I use unique on it, I get one row per - let's say - "type", but what I actually want is to get only those rows which only appear once. Does anyone know how I can achieve this?

I can have a look at clusters and heatmaps to sort it out manually, but I have bigger data frames than the one mentioned above (with up to 100 rows) where this gets a bit tricky.

like image 578
Lilith-Elina Avatar asked Dec 07 '12 12:12

Lilith-Elina


People also ask

How do I remove duplicates but keep blanks?

To remove duplicates keep blank rows, you need to add a helper column to identify the blank rows firstly, then apply Remove Duplicates function to remove the duplicates.


3 Answers

This will extract the rows which appear only once (assuming your data frame is named df):

df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]

How it works: The function duplicated tests whether a line appears at least for the second time starting at line one. If the argument fromLast = TRUE is used, the function starts at the last line.

Boths boolean results are combined with | (logical 'or') into a new vector which indicates all lines appearing more than once. The result of this is negated using ! thereby creating a boolean vector indicating lines appearing only once.

like image 52
Sven Hohenstein Avatar answered Oct 20 '22 08:10

Sven Hohenstein


A possibility involving dplyr could be:

df %>%
 group_by_all() %>%
 filter(n() == 1)

Or:

df %>%
 group_by_all() %>%
 filter(!any(row_number() > 1))

Since dplyr 1.0.0, the preferable way would be:

data %>%
    group_by(across(everything())) %>%
    filter(n() == 1)
like image 38
tmfmnk Avatar answered Oct 20 '22 07:10

tmfmnk


Try it

library(dplyr)

DF1 <- data.frame(Part = c(1,2,3,4,5), Age = c(23,34,23,25,24),  B.P = c(87,76,75,75,78))

DF2 <- data.frame(Part =c(3,5), Age = c(23,24), B.P = c(75,78))

DF3 <- rbind(DF1,DF2)

DF3 <- DF3[!(duplicated(DF3) | duplicated(DF3, fromLast = TRUE)), ]
like image 21
Brutalroot Avatar answered Oct 20 '22 08:10

Brutalroot