I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.
My approach has been to generate a table for each column in the frame into a list, then use the duplicated()
function to find rows in the list that are duplicates, as follows:
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))
This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe
containing duplicates):
summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))
If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?
To find duplicate columns we need to iterate through all columns of a DataFrame and for each and every column it will search if any other column exists in DataFrame with the same contents already. If yes then that column name will be stored in the duplicate column set.
Method 2: Using equals() methods. This method Test whether two-column contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
You can use the duplicated() function to find duplicate values in a pandas DataFrame.
How about:
testframe[!duplicated(as.list(testframe))]
You can do with lapply
:
testframe[!duplicated(lapply(testframe, summary))]
summary
summarizes the distribution while ignoring the order.
Not 100% but I would use digest if the data is huge:
library(digest)
testframe[!duplicated(lapply(testframe, digest))]
A nice trick that you can use is to transpose your data frame and then check for duplicates.
duplicated(t(testframe))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With