I am given a large data.table, which has columns of different types: e.g. numeric or character. E.g. <pre class="prettyprint"><code> data.table(name=c("A","A"),val1=c(1,2),val2=c(3,3),cat=c("u","v")) name val1 val2 cat 1: A 1 3 u 2: A 2 3 v </code></pre> As a results, I would like a data.table just with the columns, where the entries are different between the two rows: <pre class="prettyprint"><code> data.table(val1=c(1,2),cat=c("u","v")) val1 cat 1: 1 u 2: 2 v </code></pre>

With base R you could do: <pre class="prettyprint lang-r prettyprint-override"><code>library(data.table) dt <- data.table(name=c("A","A"),val1=c(1,2),val2=c(3,3),cat=c("u","v")) Filter(function(x) length(unique(x)) > 1, dt) #> val1 cat #> 1: 1 u #> 2: 2 v </code></pre>

Compare two rows of a data.table and show only columns with differences [duplicate]

Tags:

r

compare

data.table

I am given a large data.table, which has columns of different types: e.g. numeric or character. E.g.

 data.table(name=c("A","A"),val1=c(1,2),val2=c(3,3),cat=c("u","v"))

       name val1 val2 cat
   1:    A    1    3   u
   2:    A    2    3   v

As a results, I would like a data.table just with the columns, where the entries are different between the two rows:

 data.table(val1=c(1,2),cat=c("u","v"))

       val1 cat
   1:    1   u
   2:    2   v

741

asked Jul 03 '19 06:07

Strickland

2 Answers

With base R you could do:

library(data.table)

dt <- data.table(name=c("A","A"),val1=c(1,2),val2=c(3,3),cat=c("u","v"))

Filter(function(x) length(unique(x)) > 1, dt)   
#>    val1 cat
#> 1:    1   u
#> 2:    2   v

195

answered Nov 28 '22 03:11

Joris C.

You can check whether there is only one value in the column and return only the ones with more than one value:

mydt <- data.table(name=c("A", "A"), val1=c(1, 2), val2=c(3, 3), cat=c("u", "v"))
mydt_red <- mydt[, lapply(.SD, function(x) if(length(unique(x))!=1) x else NULL)]
mydt_red
#   val1 cat
#1:    1   u
#2:    2   v

EDIT
As mentionned by @kath, a more efficient way to get your result is to use min and max functions and to combine them with Filter:

mydt_red2 <- Filter(function(x) min(x)!=max(x), mydt)

Some basic benchmarking

# Data (inspired by https://stackoverflow.com/a/35746513/680068)
nrow=10000
ncol=10000
mydt <- data.frame(matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol))
setDT(mydt)

system.time(mydt_redUni <- mydt[, lapply(.SD, function(x) if(length(unique(x))>1) x else NULL)])
#utilisateur     système      écoulé 
#       2.31        0.52        2.83 
system.time(mydt_redFilt <- Filter(function(x) length(unique(x)) > 1, mydt))
#utilisateur     système      écoulé 
#     1.65        0.22        1.87 
system.time(mydt_redSort <- mydt[, lapply(.SD, function(x) {xs <- sort(x); if(xs[1]!=tail(xs, 1)) x else NULL})])
#utilisateur     système      écoulé 
#    3.87        0.00        3.87 
system.time(mydt_redMinMax <- mydt[, lapply(.SD, function(x) if(min(x)!=max(x)) x else NULL)])
#utilisateur     système      écoulé 
#    0.67        0.00        0.67 
system.time(mydt_redFiltminmax <- Filter(function(x) min(x)!=max(x), mydt))
#utilisateur     système      écoulé 
#    0.13        0.01        0.14 
system.time(mydt_redSotos <- Filter(function(i)var(as.numeric(as.factor(i))) != 0, mydt))
#utilisateur     système      écoulé
#  100.76        0.05      100.84

answered Nov 28 '22 05:11

4 revs

Related questions
                            
                                Set transparency/saturation of palette in ggplot
                            
                                Creating a named vector using dplyr
                            
                                Size legend of sf object won't show correct symbols
                            
                                Stacked barplot with colour gradients for each bar
                            
                                Error in osmar::get_osm() downloading OSM data fails: SYSTEM or PUBLIC, the URI is missing
                            
                                Singularity in backsolve at level 0, block 1 in LME model
                            
                                RDS file size difference between ggplot2 objects created inside vs. outside function
                            
                                Split and re-concatenate a string
                            
                                Retrieve Census tract from Coordinates [closed]
                            
                                dplyr lag with n from column values
                            
                                Center leaflet in a rmarkdown document
                            
                                Fixing the order of a Sankey flow graph in R / networkD3 package
                            
                                How to convert the result of xtabs() into dataframe in R? [duplicate]
                            
                                name character vectors with same name of list
                            
                                How to make in R matrix of intersections and unions over categories?
                            
                                How to split all strings in a column AND include prefix in all the new data
                            
                                Remove *all* duplicate rows, unless there's a "similar" row
                            
                                Create flag indicating if year variable is in the range of start:end variables in data.table
                            
                                Filter top n largest groups in data.frame
                            
                                Function to find if a value is greater than all prior values in a vector

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With