I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables. My approach has been to generate a table for each column in the frame into a list, then use the <code>duplicated()</code> function to find rows in the list that are duplicates, as follows: <pre class="prettyprint"><code>age=18:29 height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5) gender=c("M","F","M","M","F","F","M","M","F","M","F","M") testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender) tables=apply(testframe,2,table) dups=which(duplicated(tables)) testframe <- subset(testframe, select = -c(dups)) </code></pre> This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original <code>testframe</code> containing duplicates): <pre class="prettyprint"><code>summaries=apply(testframe,2,summary) dups=which(duplicated(summaries)) testframe <- subset(testframe, select = -c(dups)) </code></pre> If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?

How about: <pre class="prettyprint"><code>testframe[!duplicated(as.list(testframe))] </code></pre>

You can do with <code>lapply</code>: <pre class="prettyprint"><code>testframe[!duplicated(lapply(testframe, summary))] </code></pre> <code>summary</code> summarizes the distribution while ignoring the order. Not 100% but I would use digest if the data is huge: <pre class="prettyprint"><code>library(digest) testframe[!duplicated(lapply(testframe, digest))] </code></pre>

A nice trick that you can use is to transpose your data frame and then check for duplicates. <pre class="prettyprint"><code>duplicated(t(testframe)) </code></pre>

Identifying duplicate columns in a dataframe

Tags:

dataframe

r

I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.

My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:

age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)

tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))

This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):

summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))

If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?

981

asked Mar 22 '12 06:03

BenHealey

3 Answers

How about:

testframe[!duplicated(as.list(testframe))]

answered Oct 20 '22 12:10

Mostafa Rezaei

You can do with lapply:

testframe[!duplicated(lapply(testframe, summary))]

summary summarizes the distribution while ignoring the order.

Not 100% but I would use digest if the data is huge:

library(digest)
testframe[!duplicated(lapply(testframe, digest))]

answered Oct 20 '22 13:10

kohske

A nice trick that you can use is to transpose your data frame and then check for duplicates.

duplicated(t(testframe))

answered Oct 20 '22 12:10

hshihab

Related questions
                            
                                build R pacakge for windows -ERROR: compilation failed for package xxx
                            
                                Good algorithm to find themes in tweets ranked by follower counts?
                            
                                Kmeans inter and intra cluster ordering
                            
                                quality R code to learn form
                            
                                Error while using h2o.init in R
                            
                                Define starting value different than zero for geom_area()
                            
                                R: scraping additional data after POST only works for first page
                            
                                Any workaround to clustering mixed data types and render 3D scatter plot in R?
                            
                                Footer Position in Shiny
                            
                                Optimizing R objective function with Rcpp slower, why?
                            
                                Finding elements that do not overlap between two vectors
                            
                                R extension breaks connection to extensions directory in NetLogo
                            
                                Unable to install.packages(): system call failed: Cannot allocate memory; installation of package had non-zero exit status
                            
                                R: Convert factor column to multiple boolean columns
                            
                                create boxplots with transparent colour ggplot2
                            
                                Plot a data frame as a table
                            
                                ggplot: remove NA factor level in legend
                            
                                scale_fill_discrete and scale_fill_manual - legend options confusion
                            
                                How to split a number into digits in R
                            
                                How to create a world map in R with specific countries filled in?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With