R combining duplicate rows in a time series with different column types in a datatable

Tags:

This question is building up on another question R combining duplicate rows by ID with different column types in a dataframe. I have a datatable with a column time and some other columns of different types (factors and numerics). Here is an example:

dt <- data.table(time  = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
             abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
             farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "blau")),
             gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2),
             goff  = as.factor(c("haus", "maus", "toll", NA, "haus", NA, "maus", NA, NA, NA, NA, NA, "maus")),
             huft  = as.factor(c(NA, NA, NA, NA, NA, "wolle", NA, NA, "wolle", NA, NA, "holz", NA)),
             mode  = c(4, 2, NA, NA, 6, 5, 0, NA, NA, NA, NA, NA, 3))

Now I want to combine the duplicate times in column time. The numeric columns are defined as the mean value of all identical IDs (without the NAs!). The factor columns are combined into one. The NAs can be omitted.

dtRes <- data.table(time  = c(1, 1, 1, 2, 3, 4, 4),
                abst  = c(1, 1, 1, 0, 0, 3, 3),
                farbe = as.factor(c("keine", "keine", "keine", "keine", "keine", "rot", "blau")),
                gier  = c(2.5, 2.5, 2.5, 0, 0, 3, 3),
                goff  = as.factor(c("haus", "maus", "toll", "maus", NA, "maus", "maus")),
                huft  = as.factor(c(NA, NA, NA, "wolle", "wolle", "holz", "holz")),
                mode  = c(4, 4, 4, 2.5, NA, 3, 3))

I need some fast calculation for this, because I have about a million observations.

Some extra thoughts to this problem: farbe may not be unique. In this case I think the best idea for my data is to have a duplicate row but only with a different farbe, so there are 2 identical times and all the rest stays the same but different values for farbe. This should be just very rare case, but would be a great addition.

Also: I have a lot more numeric and factor columns in my real data so I don't want to define every single column separately. In some data tables there are no factor columns. So the solution has to work even if there are no numeric (time is always there and numeric) or factor columns.

Thx in advance!

238

asked May 18 '20 18:05

Bolle

2 Answers

We can do a group by mean

library(data.table)
library(tidyr)
library(dplyr)
dt[, lapply(.SD, function(x) if(is.numeric(x)) mean(x, na.rm = TRUE)
     else toString(unique(x[!is.na(x)]))), .(time)] %>%
     separate_rows(farbe, goff)
# A tibble: 7 x 7
#   time  abst farbe  gier goff   huft     mode
#  <dbl> <dbl> <chr> <dbl> <chr>  <chr>   <dbl>
#1     1     1 keine   2.5 "haus" ""        4  
#2     1     1 keine   2.5 "maus" ""        4  
#3     1     1 keine   2.5 "toll" ""        4  
#4     2     0 keine   0   "maus" "wolle"   2.5
#5     3     0 keine   0   ""     "wolle" NaN  
#6     4     3 rot     3   "maus" "holz"    3  
#7     4     3 blau    3   "maus" "holz"    3

Or with cSplit

library(splitstackshape)
cSplit(dt[, lapply(.SD, function(x) if(is.numeric(x)) 
    mean(x, na.rm = TRUE) else toString(unique(x[!is.na(x)]))), .(time)], 
    c('farbe', 'goff'), sep= ',\\s*', 'long', fixed = FALSE)
#   time abst farbe gier goff  huft mode
#1:    1    1 keine  2.5 haus        4.0
#2:    1    1  <NA>  2.5 maus        4.0
#3:    1    1  <NA>  2.5 toll        4.0
#4:    2    0 keine  0.0 maus wolle  2.5
#5:    3    0 keine  0.0 <NA> wolle  NaN
#6:    4    3   rot  3.0 maus  holz  3.0
#7:    4    3  blau  3.0 <NA>  holz  3.0

answered Nov 09 '22 05:11

akrun

The expected result (for the given sample dataset) can also be achieved without a subsequent call to separate_rows() or cSplit():

library(data.table) # version 1.12.9
dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
            else unlist(na.omit(unique(x)))), by = time]

   time abst farbe gier goff  huft mode
1:    1    1 keine  2.5 haus  <NA>  4.0
2:    1    1 keine  2.5 maus  <NA>  4.0
3:    1    1 keine  2.5 toll  <NA>  4.0
4:    2    0 keine  0.0 maus wolle  2.5
5:    3    0 keine  0.0 <NA> wolle  NaN
6:    4    3   rot  3.0 maus  holz  3.0
7:    4    3  blau  3.0 maus  holz  3.0

Please, note that this approach will work for an arbitrary mix of numeric and factor columns; no column names need to be stated explicitly.

However, I do believe the correct answer to the underlying problem is to return one row per time instead of a kind of partial aggregate (your mileage may vary, of course):

dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
                   else list(na.omit(unique(x)))), by = time]

   time abst    farbe gier           goff  huft mode
1:    1    1    keine  2.5 haus,maus,toll        4.0
2:    2    0    keine  0.0           maus wolle  2.5
3:    3    0    keine  0.0                wolle  NaN
4:    4    3 rot,blau  3.0           maus  holz  3.0

Please, note that list() instead of toString() has been used to aggregate the factor columns. This has the benefit to avoid problems in case one of the factor levels includes a comma , by chance. Furthermore, it is easier to identify cases with non-unique factors per time in a large production dataset:

# compute aggregate as before
dtRes <- dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
                   else list(na.omit(unique(x)))), by = time]
# find cases with non-unique factors per group
# note .SDcols = is.list is available with data.table version 1.12.9
tmp <- dtRes[, which(Reduce(sum, lapply(.SD, function(x) lengths(x) > 1L)) > 0), .SDcols = is.list, by = time]
tmp

   time V1
1:    1  1
2:    4  1

# show affected rows
dtRes[tmp, on = "time"]

   time abst    farbe gier           goff huft mode V1
1:    1    1    keine  2.5 haus,maus,toll         4  1
2:    4    3 rot,blau  3.0           maus holz    3  1

# show not affected rows
dtRes[!tmp, on = "time"]

   time abst farbe gier goff  huft mode
1:    2    0 keine    0 maus wolle  2.5
2:    3    0 keine    0      wolle  NaN

answered Nov 09 '22 07:11

Uwe

Related questions
                            
                                What exactly is the SEXP data type in R's C API and why is it used? [closed]
                            
                                Trouble installing tabulizer package
                            
                                Calculate average of last 3 non holiday weekdays
                            
                                Is it possible to use R Plotly library in R Script Visual of Power BI?
                            
                                Inherit Roxygen2 documentation for multiple arguments in R package
                            
                                Shinydashboard 'topbar'
                            
                                alpha and fill legends in ggplot2 boxplots?
                            
                                Pass a condition as a function parameter
                            
                                Make list content available in a function environment
                            
                                Equivalent of apply() by row in the tidyverse?
                            
                                Line break in ggplot annotate with LateX expression
                            
                                Plotly 3D filling under the line
                            
                                Trainable sklearn StandardScaler for R
                            
                                ggplot2: how to separate geom_polygon and geom_line in legend keys?
                            
                                Merging Polygons in Shape Files with Common Tag IDs: unionSpatialPolygons
                            
                                Adding figures and tables after bibliography in Rmarkdown
                            
                                Why am I referred to the manual page for "Arithmetic" when I type `?mgcv-faq` without `library(mgcv)`?
                            
                                What does a tilde (~) in front of a single variable mean (facet_wrap)?
                            
                                What is the origin of numeric dates starting with 6?
                            
                                gganimate plot not showing and saving bunch of .png's

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R combining duplicate rows in a time series with different column types in a datatable

Tags:

r

datatable

aggregate

time-series

Bolle

People also ask

2 Answers

akrun

Uwe

Recent Activity

Donate For Us