Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identifying duplicate columns in a dataframe

Tags:

dataframe

r

I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.

My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:

age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)

tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))

This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):

summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))

If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?

like image 981
BenHealey Avatar asked Mar 22 '12 06:03

BenHealey


People also ask

How do you find duplicate columns in a data frame?

To find duplicate columns we need to iterate through all columns of a DataFrame and for each and every column it will search if any other column exists in DataFrame with the same contents already. If yes then that column name will be stored in the duplicate column set.

How can you tell if two columns are identical in pandas?

Method 2: Using equals() methods. This method Test whether two-column contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.

How can I see duplicates in pandas?

You can use the duplicated() function to find duplicate values in a pandas DataFrame.


3 Answers

How about:

testframe[!duplicated(as.list(testframe))]
like image 80
Mostafa Rezaei Avatar answered Oct 20 '22 12:10

Mostafa Rezaei


You can do with lapply:

testframe[!duplicated(lapply(testframe, summary))]

summary summarizes the distribution while ignoring the order.

Not 100% but I would use digest if the data is huge:

library(digest)
testframe[!duplicated(lapply(testframe, digest))]
like image 28
kohske Avatar answered Oct 20 '22 13:10

kohske


A nice trick that you can use is to transpose your data frame and then check for duplicates.

duplicated(t(testframe))
like image 21
hshihab Avatar answered Oct 20 '22 12:10

hshihab