I am writing a package to analyse high throughput animal behaviour data in <code>R</code>. The data are multivariate time series. I have chosen to represent them using <code>data.tables</code>, which I find very convenient. For one animal, I would have something like that: <pre class="prettyprint"><code>one_animal_dt <- data.table(t=1:20, x=rnorm(20), y=rnorm(20)) </code></pre> However, my users and I work with many animals having different arbitrary treatments, conditions and other variables that are constant within each animal. In the end, the most convenient way I found to represent the data was to merge behaviour from all the animals and all the experiments in a single data table, and use extra columns, which I set as key, for each one of these "repeated variables". So, conceptually, something like that: <pre class="prettyprint"><code>animal_list <- list() animal_list[[1]] <- data.table(t=1:20, x=rnorm(20), y=rnorm(20), treatment="A", date="2017-02-21 20:00:00", animal_id=1) animal_list[[2]] <- data.table(t=1:20, x=rnorm(20), y=rnorm(20), treatment="B", date="2017-02-21 22:00:00", animal_id=2) # ... final_dt <- rbindlist(animal_list) setkeyv(final_dt,c("treatment", "date","animal_id")) </code></pre> This way makes it very convenient to compute summaries per animal whilst being agnostic about all biological information (treatments and so on). In practice, we have millions of (rather than 20) consecutive reads for each animal, so the columns we added for convenience contain highly repeated values, which is not memory efficient. Is there a way to compress this highly redundant key without losing the structure (i.e. the columns) of the table? Ideally, I don't want to force my users to use JOINs themselves.

Let's assume, we are a database administrator given the task to implement this efficiently in a SQL database. One of the goals of database normalisation is the reduction of redundancy. According to OP's description, there are many (about 1 M) of observations per animal (multivariate, longitudinal data) while the number of animals seems to be much smaller. So, the constant (or invariant) base data of each animal, e.g., <code>treatment</code>, <code>date</code>, should be kept separately from the <code>observations</code>. <code>animal_id</code> is the key into both tables assuming <code>animal_id</code> is unique (as the name suggests). (Note that this is the main difference to Mallick's answer who uses <code>treatment</code> as key which is not guaranteed to be unique, i.e., two animals may receive the same treatment, and furthermore increases redundancy.) <h3>Separate tables are memory efficient</h3> For the purpose of demonstration, more realistic "benchmark" data are being created for 10 animals with 1 M observations for each animal: <pre class="prettyprint"><code>library(data.table) # CRAN version 1.10.4 used # create observations n_obs <- 1E6L n_animals <-10L set.seed(123L) observations <- data.table( animal_id = rep(seq_len(n_animals), each = n_obs), t = rep(seq_len(n_obs), n_animals), x = rnorm(n_animals * n_obs), y = rnorm(n_animals * n_obs)) # create animal base data animals = data.table( animal_id = seq_len(n_animals), treatment = wakefield::string(n_animals), date = wakefield::date_stamp(n_animals, random = TRUE)) </code></pre> Here the <code>wakefield</code> package is used to create dummy names and dates. Note that <code>animal_id</code> is of type integer. <blockquote> <pre class="prettyprint"><code>> str(observations) Classes ‘data.table’ and 'data.frame': 10000000 obs. of 4 variables: $ animal_id: int 1 1 1 1 1 1 1 1 1 1 ... $ t : int 1 2 3 4 5 6 7 8 9 10 ... $ x : num -0.5605 -0.2302 1.5587 0.0705 0.1293 ... $ y : num 0.696 -0.537 -3.043 1.849 -1.085 ... - attr(*, ".internal.selfref")=<externalptr> > str(animals) Classes ‘data.table’ and 'data.frame': 10 obs. of 3 variables: $ animal_id: int 1 2 3 4 5 6 7 8 9 10 $ treatment:Classes 'variable', 'character' atomic [1:10] MADxZ9c6fN ymoJHnvrRx ifdtywJ4jU Q7ZRwnQCsU ... .. ..- attr(*, "varname")= chr "String" $ date : variable, format: "2017-07-02" "2016-10-02" ... - attr(*, ".internal.selfref")=<externalptr> </code></pre> </blockquote> The combined size is about 240 Mbytes: <blockquote> <pre class="prettyprint"><code>> object.size(observations) 240001568 bytes > object.size(animals) 3280 bytes </code></pre> </blockquote> Let's take this is a reference and compare with the OP's approach <code>final_dt</code>: <pre class="prettyprint"><code># join both tables to create equivalent of final_dt joined <- animals[observations, on = "animal_id"] </code></pre> The size has now nearly doubled (400 Mbytes) which is not memory efficient. <blockquote> <pre class="prettyprint"><code>> object.size(joined) 400003432 bytes </code></pre> </blockquote> Note that no <code>data.table</code> key was set so far. Instead the <code>on</code> parameter was used to specify the column to join on. If we set the key, joins will be speed up and the <code>on</code> parameter can be omitted: <pre class="prettyprint"><code>setkey(observations, animal_id) setkey(animals, animal_id) joined <- animals[observations] </code></pre> <h3>How to work with separate tables?</h3> Now, we have demonstrated that it is memory efficient to use two separate tables. For subsequent analysis, we can aggregate the <code>observations</code> per animal, e.g., <pre class="prettyprint"><code>observations[, .(.N, mean(x), mean(y)), by = animal_id] </code></pre> <blockquote> <pre class="prettyprint"><code> animal_id N V2 V3 1: 1 1000000 -5.214370e-04 -0.0019643145 2: 2 1000000 -1.555513e-03 0.0002489457 3: 3 1000000 1.541233e-06 -0.0005317967 4: 4 1000000 1.775802e-04 0.0016212182 5: 5 1000000 -9.026074e-04 0.0015266330 6: 6 1000000 -1.000892e-03 0.0003284044 7: 7 1000000 1.770055e-04 -0.0018654386 8: 8 1000000 1.919562e-03 0.0008605261 9: 9 1000000 1.175696e-03 0.0005042170 10: 10 1000000 1.681614e-03 0.0020562628 </code></pre> </blockquote> and join the aggregates with <code>animals</code> <pre class="prettyprint"><code>animals[observations[, .(.N, mean(x), mean(y)), by = animal_id]] </code></pre> <blockquote> <pre class="prettyprint"><code> animal_id treatment date N V2 V3 1: 1 MADxZ9c6fN 2017-07-02 1000000 -5.214370e-04 -0.0019643145 2: 2 ymoJHnvrRx 2016-10-02 1000000 -1.555513e-03 0.0002489457 3: 3 ifdtywJ4jU 2016-10-02 1000000 1.541233e-06 -0.0005317967 4: 4 Q7ZRwnQCsU 2017-02-02 1000000 1.775802e-04 0.0016212182 5: 5 H2M4V9Dfxz 2017-04-02 1000000 -9.026074e-04 0.0015266330 6: 6 29P3hFxqNY 2017-03-02 1000000 -1.000892e-03 0.0003284044 7: 7 rBxjewyGML 2017-02-02 1000000 1.770055e-04 -0.0018654386 8: 8 gQP8cZhcTT 2017-04-02 1000000 1.919562e-03 0.0008605261 9: 9 0GEOseSshh 2017-07-02 1000000 1.175696e-03 0.0005042170 10: 10 x74yDs2MdT 2017-02-02 1000000 1.681614e-03 0.0020562628 </code></pre> </blockquote> The OP has pointed out that he doesn't want to force his users to use joins themselves. Admittedly, typing <code>animals[observations]</code> takes more keystrokes than <code>final_dt</code>. So, it's up to the OP to decide whether this is worthwhile to save memory, or not. This result can be filtered, for instance, if we want to compare animals with certain characteristics, e.g., <pre class="prettyprint"><code>animals[observations[, .(.N, mean(x), mean(y)), by = animal_id]][date == as.Date("2017-07-02")] </code></pre> <blockquote> <pre class="prettyprint"><code> animal_id treatment date N V2 V3 1: 1 MADxZ9c6fN 2017-07-02 1000000 -0.000521437 -0.001964315 2: 9 0GEOseSshh 2017-07-02 1000000 0.001175696 0.000504217 </code></pre> </blockquote> <h3>OP's use cases</h3> In this coment, the OP has described some use cases which he wants to see implemenetd transparently for his users: <ul> <li>Creation of new columns <code>final_dt[, x2 := 1-x]</code>: As only obervations are involved, this translates directly to <code>observations[, x2 := 1-x]</code>.</li> <li> Select using various criteria <code>final_dt[t > 5 & treatment == "A"]</code>: Here columns of both tables are involved. This can be implemented with <code>data.table</code> in different ways (note that the conditions have been amended for the actual sample data): <pre class="prettyprint"><code>animals[observations][t < 5L & treatment %like% "MAD"] </code></pre> This is analogue to the expected syntax but is slower than the alternative below because here the filter conditions are applied on all rows of the full join. The faster alternative is to split up the filter conditions so that <code>observations</code> is filtered before the join to reduce the result set before the filter conditions on base data columns are applied finally: <pre class="prettyprint"><code>animals[observations[t < 5L]][treatment %like% "MAD"] </code></pre> Note that this looks quite similar to the expected syntax (with one keystroke less). If this is deemed unacceptable by the users, the join operation can be hidden in a function: <pre class="prettyprint"><code># function definition filter_dt <- function(ani_filter = "", obs_filter = "") { eval(parse(text = stringr::str_interp( 'animals[observations[${obs_filter}]][${ani_filter}]'))) } # called by user filter_dt("treatment %like% 'MAD'", "t < 5L") </code></pre> <blockquote> <pre class="prettyprint"><code> animal_id treatment date t x y 1: 1 MADxZ9c6fN 2017-07-02 1 -0.56047565 0.6958622 2: 1 MADxZ9c6fN 2017-07-02 2 -0.23017749 -0.5373377 3: 1 MADxZ9c6fN 2017-07-02 3 1.55870831 -3.0425688 4: 1 MADxZ9c6fN 2017-07-02 4 0.07050839 1.8488057 </code></pre> </blockquote> </li> </ul> <h3>Using factors to reduce memory footprint</h3> Caveat: Your mileage may vary as the conclusions below depend on the internal representation of integers on your computer and the cardinality of the data. Please, see Matt Dowle's excellent answer concerning this subject. Mallick has mentioned that memory might get wasted if integers incidentially are stored as numerics. This can be demonstrated: <pre class="prettyprint"><code>n <- 10000L # integer vs numeric vs logical test_obj_size <- data.table( rep(1, n), rep(1L, n), rep(TRUE, n)) str(test_obj_size) </code></pre> <blockquote> <pre class="prettyprint"><code>Classes ‘data.table’ and 'data.frame': 10000 obs. of 3 variables: $ V1: num 1 1 1 1 1 1 1 1 1 1 ... $ V2: int 1 1 1 1 1 1 1 1 1 1 ... $ V3: logi TRUE TRUE TRUE TRUE TRUE TRUE ... - attr(*, ".internal.selfref")=<externalptr> </code></pre> </blockquote> <pre class="prettyprint"><code>sapply(test_obj_size, object.size) </code></pre> <blockquote> <pre class="prettyprint"><code> V1 V2 V3 80040 40040 40040 </code></pre> </blockquote> Note that the numeric vector needs twice as much memory as the integer vector. Therefore, it is good programming practice to always qualify an integer constant with the suffix character <code>L</code>. Also the memory consumption of character strings can be reduced if they are coerced to factor: <pre class="prettyprint"><code># character vs factor test_obj_size <- data.table( rep("A", n), rep("AAAAAAAAAAA", n), rep_len(LETTERS, n), factor(rep("A", n)), factor(rep("AAAAAAAAAAA", n)), factor(rep_len(LETTERS, n))) str(test_obj_size) </code></pre> <blockquote> <pre class="prettyprint"><code>Classes ‘data.table’ and 'data.frame': 10000 obs. of 6 variables: $ V1: chr "A" "A" "A" "A" ... $ V2: chr "AAAAAAAAAAA" "AAAAAAAAAAA" "AAAAAAAAAAA" "AAAAAAAAAAA" ... $ V3: chr "A" "B" "C" "D" ... $ V4: Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ... $ V5: Factor w/ 1 level "AAAAAAAAAAA": 1 1 1 1 1 1 1 1 1 1 ... $ V6: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ... - attr(*, ".internal.selfref")=<externalptr> </code></pre> </blockquote> <pre class="prettyprint"><code>sapply(test_obj_size, object.size) </code></pre> <blockquote> <pre class="prettyprint"><code> V1 V2 V3 V4 V5 V6 80088 80096 81288 40456 40464 41856 </code></pre> </blockquote> Stored as factor, only half of the memory is required. The same holds for <code>Date</code> and <code>POSIXct</code> classes: <pre class="prettyprint"><code># Date & POSIXct vs factor test_obj_size <- data.table( rep(as.Date(Sys.time()), n), rep(as.POSIXct(Sys.time()), n), factor(rep(as.Date(Sys.time()), n)), factor(rep(as.POSIXct(Sys.time()), n))) str(test_obj_size) </code></pre> <blockquote> <pre class="prettyprint"><code>Classes ‘data.table’ and 'data.frame': 10000 obs. of 4 variables: $ V1: Date, format: "2017-08-02" "2017-08-02" "2017-08-02" "2017-08-02" ... $ V2: POSIXct, format: "2017-08-02 18:25:55" "2017-08-02 18:25:55" "2017-08-02 18:25:55" "2017-08-02 18:25:55" ... $ V3: Factor w/ 1 level "2017-08-02": 1 1 1 1 1 1 1 1 1 1 ... $ V4: Factor w/ 1 level "2017-08-02 18:25:55": 1 1 1 1 1 1 1 1 1 1 ... - attr(*, ".internal.selfref")=<externalptr> </code></pre> </blockquote> <pre class="prettyprint"><code>sapply(test_obj_size, object.size) </code></pre> <blockquote> <pre class="prettyprint"><code> V1 V2 V3 V4 80248 80304 40464 40480 </code></pre> </blockquote> Note that <code>data.table()</code> refuses to create a column of class <code>POSIXlt</code> as it is stored in 40 bytes instead of 8 bytes. So, if your application is memory critical it might be worthwhile to consider to use factor where applicable.

Reduce memory footprint of data.table with highly repeated key

Tags:

memory

r

data.table

I am writing a package to analyse high throughput animal behaviour data in R. The data are multivariate time series. I have chosen to represent them using data.tables, which I find very convenient.

For one animal, I would have something like that:

one_animal_dt <- data.table(t=1:20, x=rnorm(20), y=rnorm(20))

However, my users and I work with many animals having different arbitrary treatments, conditions and other variables that are constant within each animal.

In the end, the most convenient way I found to represent the data was to merge behaviour from all the animals and all the experiments in a single data table, and use extra columns, which I set as key, for each one of these "repeated variables".

So, conceptually, something like that:

animal_list <- list()
animal_list[[1]] <- data.table(t=1:20, x=rnorm(20), y=rnorm(20),
                               treatment="A", date="2017-02-21 20:00:00", 
                               animal_id=1)
animal_list[[2]]  <- data.table(t=1:20, x=rnorm(20), y=rnorm(20),
                                treatment="B", date="2017-02-21 22:00:00",
                                animal_id=2)
# ...
final_dt <- rbindlist(animal_list)
setkeyv(final_dt,c("treatment", "date","animal_id"))

This way makes it very convenient to compute summaries per animal whilst being agnostic about all biological information (treatments and so on).

In practice, we have millions of (rather than 20) consecutive reads for each animal, so the columns we added for convenience contain highly repeated values, which is not memory efficient.

Is there a way to compress this highly redundant key without losing the structure (i.e. the columns) of the table? Ideally, I don't want to force my users to use JOINs themselves.

327

asked Jul 19 '17 15:07

Quentin Geissmann

1 Answers

Let's assume, we are a database administrator given the task to implement this efficiently in a SQL database. One of the goals of database normalisation is the reduction of redundancy.

According to OP's description, there are many (about 1 M) of observations per animal (multivariate, longitudinal data) while the number of animals seems to be much smaller.

So, the constant (or invariant) base data of each animal, e.g., treatment, date, should be kept separately from the observations.

animal_id is the key into both tables assuming animal_id is unique (as the name suggests).

(Note that this is the main difference to Mallick's answer who uses treatment as key which is not guaranteed to be unique, i.e., two animals may receive the same treatment, and furthermore increases redundancy.)

Separate tables are memory efficient

For the purpose of demonstration, more realistic "benchmark" data are being created for 10 animals with 1 M observations for each animal:

library(data.table)   # CRAN version 1.10.4 used
# create observations
n_obs <- 1E6L
n_animals <-10L
set.seed(123L)
observations <- data.table(
  animal_id = rep(seq_len(n_animals), each = n_obs),
  t = rep(seq_len(n_obs), n_animals),
  x = rnorm(n_animals * n_obs), 
  y = rnorm(n_animals * n_obs))
# create animal base data
animals = data.table(
  animal_id = seq_len(n_animals),
  treatment = wakefield::string(n_animals),
  date = wakefield::date_stamp(n_animals, random = TRUE))

Here the wakefield package is used to create dummy names and dates. Note that animal_id is of type integer.

> str(observations)
Classes ‘data.table’ and 'data.frame':    10000000 obs. of  4 variables:
 $ animal_id: int  1 1 1 1 1 1 1 1 1 1 ...
 $ t        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ x        : num  -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
 $ y        : num  0.696 -0.537 -3.043 1.849 -1.085 ...
 - attr(*, ".internal.selfref")=<externalptr> 
> str(animals)
Classes ‘data.table’ and 'data.frame':    10 obs. of  3 variables:
 $ animal_id: int  1 2 3 4 5 6 7 8 9 10
 $ treatment:Classes 'variable', 'character'  atomic [1:10] MADxZ9c6fN ymoJHnvrRx ifdtywJ4jU Q7ZRwnQCsU ...
  .. ..- attr(*, "varname")= chr "String"
 $ date     : variable, format: "2017-07-02" "2016-10-02" ...
 - attr(*, ".internal.selfref")=<externalptr>

The combined size is about 240 Mbytes:

> object.size(observations)
240001568 bytes
> object.size(animals)
3280 bytes

Let's take this is a reference and compare with the OP's approach final_dt:

# join both tables to create equivalent of final_dt
joined <- animals[observations, on = "animal_id"]

The size has now nearly doubled (400 Mbytes) which is not memory efficient.

> object.size(joined)
400003432 bytes

Note that no data.table key was set so far. Instead the on parameter was used to specify the column to join on. If we set the key, joins will be speed up and the on parameter can be omitted:

setkey(observations, animal_id)
setkey(animals, animal_id)
joined <- animals[observations]

How to work with separate tables?

Now, we have demonstrated that it is memory efficient to use two separate tables.

For subsequent analysis, we can aggregate the observations per animal, e.g.,

observations[, .(.N, mean(x), mean(y)), by = animal_id]

    animal_id       N            V2            V3
 1:         1 1000000 -5.214370e-04 -0.0019643145
 2:         2 1000000 -1.555513e-03  0.0002489457
 3:         3 1000000  1.541233e-06 -0.0005317967
 4:         4 1000000  1.775802e-04  0.0016212182
 5:         5 1000000 -9.026074e-04  0.0015266330
 6:         6 1000000 -1.000892e-03  0.0003284044
 7:         7 1000000  1.770055e-04 -0.0018654386
 8:         8 1000000  1.919562e-03  0.0008605261
 9:         9 1000000  1.175696e-03  0.0005042170
10:        10 1000000  1.681614e-03  0.0020562628

and join the aggregates with animals

animals[observations[, .(.N, mean(x), mean(y)), by = animal_id]]

    animal_id  treatment       date       N            V2            V3
 1:         1 MADxZ9c6fN 2017-07-02 1000000 -5.214370e-04 -0.0019643145
 2:         2 ymoJHnvrRx 2016-10-02 1000000 -1.555513e-03  0.0002489457
 3:         3 ifdtywJ4jU 2016-10-02 1000000  1.541233e-06 -0.0005317967
 4:         4 Q7ZRwnQCsU 2017-02-02 1000000  1.775802e-04  0.0016212182
 5:         5 H2M4V9Dfxz 2017-04-02 1000000 -9.026074e-04  0.0015266330
 6:         6 29P3hFxqNY 2017-03-02 1000000 -1.000892e-03  0.0003284044
 7:         7 rBxjewyGML 2017-02-02 1000000  1.770055e-04 -0.0018654386
 8:         8 gQP8cZhcTT 2017-04-02 1000000  1.919562e-03  0.0008605261
 9:         9 0GEOseSshh 2017-07-02 1000000  1.175696e-03  0.0005042170
10:        10 x74yDs2MdT 2017-02-02 1000000  1.681614e-03  0.0020562628

The OP has pointed out that he doesn't want to force his users to use joins themselves. Admittedly, typing animals[observations] takes more keystrokes than final_dt. So, it's up to the OP to decide whether this is worthwhile to save memory, or not.

This result can be filtered, for instance, if we want to compare animals with certain characteristics, e.g.,

animals[observations[, .(.N, mean(x), mean(y)), by = animal_id]][date == as.Date("2017-07-02")]

   animal_id  treatment       date       N           V2           V3
1:         1 MADxZ9c6fN 2017-07-02 1000000 -0.000521437 -0.001964315
2:         9 0GEOseSshh 2017-07-02 1000000  0.001175696  0.000504217

OP's use cases

In this coment, the OP has described some use cases which he wants to see implemenetd transparently for his users:

Creation of new columns final_dt[, x2 := 1-x]: As only obervations are involved, this translates directly to observations[, x2 := 1-x].
Select using various criteria final_dt[t > 5 & treatment == "A"]: Here columns of both tables are involved. This can be implemented with data.table in different ways (note that the conditions have been amended for the actual sample data):
```
animals[observations][t < 5L & treatment %like% "MAD"]
```
This is analogue to the expected syntax but is slower than the alternative below because here the filter conditions are applied on all rows of the full join.

The faster alternative is to split up the filter conditions so that observations is filtered before the join to reduce the result set before the filter conditions on base data columns are applied finally:
```
animals[observations[t < 5L]][treatment %like% "MAD"]
```
Note that this looks quite similar to the expected syntax (with one keystroke less).

If this is deemed unacceptable by the users, the join operation can be hidden in a function:
```
# function definition
filter_dt <- function(ani_filter = "", obs_filter = "") {
 eval(parse(text = stringr::str_interp(
 'animals[observations[${obs_filter}]][${ani_filter}]')))
}

# called by user
filter_dt("treatment %like% 'MAD'", "t < 5L")
```
```
 animal_id treatment date t x y
1: 1 MADxZ9c6fN 2017-07-02 1 -0.56047565 0.6958622
2: 1 MADxZ9c6fN 2017-07-02 2 -0.23017749 -0.5373377
3: 1 MADxZ9c6fN 2017-07-02 3 1.55870831 -3.0425688
4: 1 MADxZ9c6fN 2017-07-02 4 0.07050839 1.8488057
```

Using factors to reduce memory footprint

Caveat: Your mileage may vary as the conclusions below depend on the internal representation of integers on your computer and the cardinality of the data. Please, see Matt Dowle's excellent answer concerning this subject.

Mallick has mentioned that memory might get wasted if integers incidentially are stored as numerics. This can be demonstrated:

n <- 10000L
# integer vs numeric vs logical
test_obj_size <- data.table(
  rep(1, n),
  rep(1L, n),
  rep(TRUE, n))

str(test_obj_size)

Classes ‘data.table’ and 'data.frame':    10000 obs. of  3 variables:
 $ V1: num  1 1 1 1 1 1 1 1 1 1 ...
 $ V2: int  1 1 1 1 1 1 1 1 1 1 ...
 $ V3: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 - attr(*, ".internal.selfref")=<externalptr>

sapply(test_obj_size, object.size)

   V1    V2    V3 
80040 40040 40040

Note that the numeric vector needs twice as much memory as the integer vector. Therefore, it is good programming practice to always qualify an integer constant with the suffix character L.

Also the memory consumption of character strings can be reduced if they are coerced to factor:

# character vs factor
test_obj_size <- data.table(
  rep("A", n),
  rep("AAAAAAAAAAA", n),
  rep_len(LETTERS, n),
  factor(rep("A", n)),
  factor(rep("AAAAAAAAAAA", n)),
  factor(rep_len(LETTERS, n)))

str(test_obj_size)

Classes ‘data.table’ and 'data.frame':    10000 obs. of  6 variables:
 $ V1: chr  "A" "A" "A" "A" ...
 $ V2: chr  "AAAAAAAAAAA" "AAAAAAAAAAA" "AAAAAAAAAAA" "AAAAAAAAAAA" ...
 $ V3: chr  "A" "B" "C" "D" ...
 $ V4: Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
 $ V5: Factor w/ 1 level "AAAAAAAAAAA": 1 1 1 1 1 1 1 1 1 1 ...
 $ V6: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, ".internal.selfref")=<externalptr>

sapply(test_obj_size, object.size)

   V1    V2    V3    V4    V5    V6 
80088 80096 81288 40456 40464 41856

Stored as factor, only half of the memory is required.

The same holds for Date and POSIXct classes:

# Date & POSIXct vs factor
test_obj_size <- data.table(
  rep(as.Date(Sys.time()), n),
  rep(as.POSIXct(Sys.time()), n),
  factor(rep(as.Date(Sys.time()), n)),
  factor(rep(as.POSIXct(Sys.time()), n)))

str(test_obj_size)

Classes ‘data.table’ and 'data.frame':    10000 obs. of  4 variables:
 $ V1: Date, format: "2017-08-02" "2017-08-02" "2017-08-02" "2017-08-02" ...
 $ V2: POSIXct, format: "2017-08-02 18:25:55" "2017-08-02 18:25:55" "2017-08-02 18:25:55" "2017-08-02 18:25:55" ...
 $ V3: Factor w/ 1 level "2017-08-02": 1 1 1 1 1 1 1 1 1 1 ...
 $ V4: Factor w/ 1 level "2017-08-02 18:25:55": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr>

sapply(test_obj_size, object.size)

   V1    V2    V3    V4 
80248 80304 40464 40480

Note that data.table() refuses to create a column of class POSIXlt as it is stored in 40 bytes instead of 8 bytes.

So, if your application is memory critical it might be worthwhile to consider to use factor where applicable.

107

answered Oct 16 '22 16:10

Uwe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reduce memory footprint of data.table with highly repeated key

Tags:

memory

r

data.table

Quentin Geissmann

People also ask

1 Answers

Separate tables are memory efficient

How to work with separate tables?

OP's use cases

Using factors to reduce memory footprint

Uwe

Recent Activity

Donate For Us

Reduce memory footprint of data.table with highly repeated key

Tags:

memory

r

data.table

Quentin Geissmann

People also ask

1 Answers

Separate tables are memory efficient

How to work with separate tables?

OP's use cases

Using factors to reduce memory footprint

Uwe

Related questions

Recent Activity

Donate For Us