Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reduce memory footprint of data.table with highly repeated key

I am writing a package to analyse high throughput animal behaviour data in R. The data are multivariate time series. I have chosen to represent them using data.tables, which I find very convenient.

For one animal, I would have something like that:

one_animal_dt <- data.table(t=1:20, x=rnorm(20), y=rnorm(20))

However, my users and I work with many animals having different arbitrary treatments, conditions and other variables that are constant within each animal.

In the end, the most convenient way I found to represent the data was to merge behaviour from all the animals and all the experiments in a single data table, and use extra columns, which I set as key, for each one of these "repeated variables".

So, conceptually, something like that:

animal_list <- list()
animal_list[[1]] <- data.table(t=1:20, x=rnorm(20), y=rnorm(20),
                               treatment="A", date="2017-02-21 20:00:00", 
                               animal_id=1)
animal_list[[2]]  <- data.table(t=1:20, x=rnorm(20), y=rnorm(20),
                                treatment="B", date="2017-02-21 22:00:00",
                                animal_id=2)
# ...
final_dt <- rbindlist(animal_list)
setkeyv(final_dt,c("treatment", "date","animal_id"))

This way makes it very convenient to compute summaries per animal whilst being agnostic about all biological information (treatments and so on).

In practice, we have millions of (rather than 20) consecutive reads for each animal, so the columns we added for convenience contain highly repeated values, which is not memory efficient.

Is there a way to compress this highly redundant key without losing the structure (i.e. the columns) of the table? Ideally, I don't want to force my users to use JOINs themselves.

like image 327
Quentin Geissmann Avatar asked Jul 19 '17 15:07

Quentin Geissmann


People also ask

How do you reduce Dataframe memory usage?

Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage.

Does data table use less memory?

Memory Usage (Efficiency) data. table is the most efficient when filtering rows. dplyr is far more efficient when summarizing by group while data. table was the least efficient.

Which of the following data types are used to optimize memory usage in certain cases?

Using int8 or uint8 can reduce an array memory consumption to one-eighth of its original size. One of the most common applications of uint8 is image data.


1 Answers

Let's assume, we are a database administrator given the task to implement this efficiently in a SQL database. One of the goals of database normalisation is the reduction of redundancy.

According to OP's description, there are many (about 1 M) of observations per animal (multivariate, longitudinal data) while the number of animals seems to be much smaller.

So, the constant (or invariant) base data of each animal, e.g., treatment, date, should be kept separately from the observations.

animal_id is the key into both tables assuming animal_id is unique (as the name suggests).

(Note that this is the main difference to Mallick's answer who uses treatment as key which is not guaranteed to be unique, i.e., two animals may receive the same treatment, and furthermore increases redundancy.)

Separate tables are memory efficient

For the purpose of demonstration, more realistic "benchmark" data are being created for 10 animals with 1 M observations for each animal:

library(data.table)   # CRAN version 1.10.4 used
# create observations
n_obs <- 1E6L
n_animals <-10L
set.seed(123L)
observations <- data.table(
  animal_id = rep(seq_len(n_animals), each = n_obs),
  t = rep(seq_len(n_obs), n_animals),
  x = rnorm(n_animals * n_obs), 
  y = rnorm(n_animals * n_obs))
# create animal base data
animals = data.table(
  animal_id = seq_len(n_animals),
  treatment = wakefield::string(n_animals),
  date = wakefield::date_stamp(n_animals, random = TRUE))

Here the wakefield package is used to create dummy names and dates. Note that animal_id is of type integer.

> str(observations)
Classes ‘data.table’ and 'data.frame':    10000000 obs. of  4 variables:
 $ animal_id: int  1 1 1 1 1 1 1 1 1 1 ...
 $ t        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ x        : num  -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
 $ y        : num  0.696 -0.537 -3.043 1.849 -1.085 ...
 - attr(*, ".internal.selfref")=<externalptr> 
> str(animals)
Classes ‘data.table’ and 'data.frame':    10 obs. of  3 variables:
 $ animal_id: int  1 2 3 4 5 6 7 8 9 10
 $ treatment:Classes 'variable', 'character'  atomic [1:10] MADxZ9c6fN ymoJHnvrRx ifdtywJ4jU Q7ZRwnQCsU ...
  .. ..- attr(*, "varname")= chr "String"
 $ date     : variable, format: "2017-07-02" "2016-10-02" ...
 - attr(*, ".internal.selfref")=<externalptr>

The combined size is about 240 Mbytes:

> object.size(observations)
240001568 bytes
> object.size(animals)
3280 bytes

Let's take this is a reference and compare with the OP's approach final_dt:

# join both tables to create equivalent of final_dt
joined <- animals[observations, on = "animal_id"]

The size has now nearly doubled (400 Mbytes) which is not memory efficient.

> object.size(joined)
400003432 bytes

Note that no data.table key was set so far. Instead the on parameter was used to specify the column to join on. If we set the key, joins will be speed up and the on parameter can be omitted:

setkey(observations, animal_id)
setkey(animals, animal_id)
joined <- animals[observations] 

How to work with separate tables?

Now, we have demonstrated that it is memory efficient to use two separate tables.

For subsequent analysis, we can aggregate the observations per animal, e.g.,

observations[, .(.N, mean(x), mean(y)), by = animal_id]
    animal_id       N            V2            V3
 1:         1 1000000 -5.214370e-04 -0.0019643145
 2:         2 1000000 -1.555513e-03  0.0002489457
 3:         3 1000000  1.541233e-06 -0.0005317967
 4:         4 1000000  1.775802e-04  0.0016212182
 5:         5 1000000 -9.026074e-04  0.0015266330
 6:         6 1000000 -1.000892e-03  0.0003284044
 7:         7 1000000  1.770055e-04 -0.0018654386
 8:         8 1000000  1.919562e-03  0.0008605261
 9:         9 1000000  1.175696e-03  0.0005042170
10:        10 1000000  1.681614e-03  0.0020562628

and join the aggregates with animals

animals[observations[, .(.N, mean(x), mean(y)), by = animal_id]]
    animal_id  treatment       date       N            V2            V3
 1:         1 MADxZ9c6fN 2017-07-02 1000000 -5.214370e-04 -0.0019643145
 2:         2 ymoJHnvrRx 2016-10-02 1000000 -1.555513e-03  0.0002489457
 3:         3 ifdtywJ4jU 2016-10-02 1000000  1.541233e-06 -0.0005317967
 4:         4 Q7ZRwnQCsU 2017-02-02 1000000  1.775802e-04  0.0016212182
 5:         5 H2M4V9Dfxz 2017-04-02 1000000 -9.026074e-04  0.0015266330
 6:         6 29P3hFxqNY 2017-03-02 1000000 -1.000892e-03  0.0003284044
 7:         7 rBxjewyGML 2017-02-02 1000000  1.770055e-04 -0.0018654386
 8:         8 gQP8cZhcTT 2017-04-02 1000000  1.919562e-03  0.0008605261
 9:         9 0GEOseSshh 2017-07-02 1000000  1.175696e-03  0.0005042170
10:        10 x74yDs2MdT 2017-02-02 1000000  1.681614e-03  0.0020562628

The OP has pointed out that he doesn't want to force his users to use joins themselves. Admittedly, typing animals[observations] takes more keystrokes than final_dt. So, it's up to the OP to decide whether this is worthwhile to save memory, or not.

This result can be filtered, for instance, if we want to compare animals with certain characteristics, e.g.,

animals[observations[, .(.N, mean(x), mean(y)), by = animal_id]][date == as.Date("2017-07-02")]
   animal_id  treatment       date       N           V2           V3
1:         1 MADxZ9c6fN 2017-07-02 1000000 -0.000521437 -0.001964315
2:         9 0GEOseSshh 2017-07-02 1000000  0.001175696  0.000504217

OP's use cases

In this coment, the OP has described some use cases which he wants to see implemenetd transparently for his users:

  • Creation of new columns final_dt[, x2 := 1-x]: As only obervations are involved, this translates directly to observations[, x2 := 1-x].
  • Select using various criteria final_dt[t > 5 & treatment == "A"]: Here columns of both tables are involved. This can be implemented with data.table in different ways (note that the conditions have been amended for the actual sample data):

    animals[observations][t < 5L & treatment %like% "MAD"]
    

    This is analogue to the expected syntax but is slower than the alternative below because here the filter conditions are applied on all rows of the full join.

    The faster alternative is to split up the filter conditions so that observations is filtered before the join to reduce the result set before the filter conditions on base data columns are applied finally:

    animals[observations[t < 5L]][treatment %like% "MAD"]
    

    Note that this looks quite similar to the expected syntax (with one keystroke less).

    If this is deemed unacceptable by the users, the join operation can be hidden in a function:

    # function definition
    filter_dt <- function(ani_filter = "", obs_filter = "") {
      eval(parse(text = stringr::str_interp(
        'animals[observations[${obs_filter}]][${ani_filter}]')))
    }
    
    # called by user
    filter_dt("treatment %like% 'MAD'", "t < 5L")
    
       animal_id  treatment       date t           x          y
    1:         1 MADxZ9c6fN 2017-07-02 1 -0.56047565  0.6958622
    2:         1 MADxZ9c6fN 2017-07-02 2 -0.23017749 -0.5373377
    3:         1 MADxZ9c6fN 2017-07-02 3  1.55870831 -3.0425688
    4:         1 MADxZ9c6fN 2017-07-02 4  0.07050839  1.8488057
    

Using factors to reduce memory footprint

Caveat: Your mileage may vary as the conclusions below depend on the internal representation of integers on your computer and the cardinality of the data. Please, see Matt Dowle's excellent answer concerning this subject.

Mallick has mentioned that memory might get wasted if integers incidentially are stored as numerics. This can be demonstrated:

n <- 10000L
# integer vs numeric vs logical
test_obj_size <- data.table(
  rep(1, n),
  rep(1L, n),
  rep(TRUE, n))

str(test_obj_size)
Classes ‘data.table’ and 'data.frame':    10000 obs. of  3 variables:
 $ V1: num  1 1 1 1 1 1 1 1 1 1 ...
 $ V2: int  1 1 1 1 1 1 1 1 1 1 ...
 $ V3: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 - attr(*, ".internal.selfref")=<externalptr>
sapply(test_obj_size, object.size)
   V1    V2    V3 
80040 40040 40040

Note that the numeric vector needs twice as much memory as the integer vector. Therefore, it is good programming practice to always qualify an integer constant with the suffix character L.

Also the memory consumption of character strings can be reduced if they are coerced to factor:

# character vs factor
test_obj_size <- data.table(
  rep("A", n),
  rep("AAAAAAAAAAA", n),
  rep_len(LETTERS, n),
  factor(rep("A", n)),
  factor(rep("AAAAAAAAAAA", n)),
  factor(rep_len(LETTERS, n)))

str(test_obj_size)
Classes ‘data.table’ and 'data.frame':    10000 obs. of  6 variables:
 $ V1: chr  "A" "A" "A" "A" ...
 $ V2: chr  "AAAAAAAAAAA" "AAAAAAAAAAA" "AAAAAAAAAAA" "AAAAAAAAAAA" ...
 $ V3: chr  "A" "B" "C" "D" ...
 $ V4: Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
 $ V5: Factor w/ 1 level "AAAAAAAAAAA": 1 1 1 1 1 1 1 1 1 1 ...
 $ V6: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, ".internal.selfref")=<externalptr>
sapply(test_obj_size, object.size)
   V1    V2    V3    V4    V5    V6 
80088 80096 81288 40456 40464 41856

Stored as factor, only half of the memory is required.

The same holds for Date and POSIXct classes:

# Date & POSIXct vs factor
test_obj_size <- data.table(
  rep(as.Date(Sys.time()), n),
  rep(as.POSIXct(Sys.time()), n),
  factor(rep(as.Date(Sys.time()), n)),
  factor(rep(as.POSIXct(Sys.time()), n)))

str(test_obj_size)
Classes ‘data.table’ and 'data.frame':    10000 obs. of  4 variables:
 $ V1: Date, format: "2017-08-02" "2017-08-02" "2017-08-02" "2017-08-02" ...
 $ V2: POSIXct, format: "2017-08-02 18:25:55" "2017-08-02 18:25:55" "2017-08-02 18:25:55" "2017-08-02 18:25:55" ...
 $ V3: Factor w/ 1 level "2017-08-02": 1 1 1 1 1 1 1 1 1 1 ...
 $ V4: Factor w/ 1 level "2017-08-02 18:25:55": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr>
sapply(test_obj_size, object.size)
   V1    V2    V3    V4 
80248 80304 40464 40480

Note that data.table() refuses to create a column of class POSIXlt as it is stored in 40 bytes instead of 8 bytes.

So, if your application is memory critical it might be worthwhile to consider to use factor where applicable.

like image 107
Uwe Avatar answered Oct 16 '22 16:10

Uwe