Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why data.table unnesting time grows with the square of number of rows for a specific dataset?

Tags:

r

data.table

My goal is to unnest a column within a data.table. The original data.table has more than 800k rows, the following link has a sample for 5k rows.

I noticed however that the required time to unnest this dataset grows with the square of the number of rows, and not roughly in a linear fashion as I expected:

# Subset for 500 rows    
> item_res <- item[1:500] 
> microbenchmark(item_res[, lance[[1]], by = item_id], times = 5L)
Unit: milliseconds
                                                  expr      min       lq     mean   median       uq     max neval
 item_int <- item_res[, lance[[1]], by = item_id] 281.3878 282.2426 286.9925 284.4111 286.1291 300.792     5

# Subset for 5000 rows
> item_res <- item[1:5000] 
> microbenchmark(item_res[, lance[[1]], by = item_id], times = 5L)
Unit: seconds
                                                  expr      min       lq     mean   median      uq     max neval
 item_int <- item_res[, lance[[1]], by = item_id] 44.35222 47.21508 47.40021 47.38034 47.9733 50.0801     5

Subsetting for 50000 takes forever...

I expected that the processing time would scale linearly with the number of rows, and not with the square. What may be going on?

PS: 1- I checked that for the following mock data (with the memory size of original 800k row dataset) the time to unnest is roughly linear, as expected:

set.seed(1234)
n <- 5E4
n_nested <- 40

v1 <- data.table(val = as.numeric( 1:n_nested)        , ll = letters[sample(1:20, size = n_nested, replace = T)])
v2 <- data.table(val = as.numeric(1:n_nested *2)     , ll = letters[sample(1:20, size = n_nested, replace = T)])
v3 <- data.table(val = as.numeric(1:n_nested *2+1)   , ll = letters[sample(1:20, size = n_nested, replace = T)])
char_1 <- as.character(1:n)
char_2 <- as.character(sample(1:n,n))
out <- data.table(char_1 = char_1,char_2 = char_2, value = list(v1,v2,v3))

microbenchmark(out[, value[[1]], by = .(char_1, char_2)]  , times = 5L)

For number of rows = 5E4:

Unit: milliseconds
                                      expr      min       lq     mean   median       uq      max neval
 out[, value[[1]], by = .(char_1, char_2)] 154.6323 169.8117 181.0281 183.1168 187.1461 210.4336     5

For number of rows = 5E5:

Unit: seconds
                                      expr      min       lq     mean   median       uq      max neval
 out[, value[[1]], by = .(char_1, char_2)] 2.137035 2.152496 2.359902 2.178358 2.324148 3.007475     5

For number of rows = 5E6:

Unit: seconds
                                      expr      min       lq     mean   median       uq      max neval
 out[, value[[1]], by = .(char_1, char_2)] 38.49398 40.88233 47.28661 41.20114 44.95406 70.90152     5

2- I am using a Intel I7 with 16GB RAM, all R, RStudio and data.table package updated (RStudio Version 1.3.1056, R 4.0.2, data.table 1.13.0) the computer never pages memory into disk along the process

3- I also tried other unnesting implementations (the chosen one discussed above was the fastest):

item_res[, lance[[1]], by = unnest_names]                        # Chosen one
item_res[, unlist(lance, recursive = FALSE), by = unnest_names]  # A little bit slower than above
item_res[, rbindlist(lance), by = unnest_names]                  # much slower than above

4- As per Chirico's request, the verbose and sessionInfo() for each version:

V 1.12.8

> item_int <- item[, unlist(lance, recursive = F ), by = unnest_names, verbose = TRUE ] 
Detected that j uses these columns: lance 
Finding groups using forderv ... forder.c received 872581 rows and 11 columns
0.150s elapsed (0.170s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
lapply optimization is on, j unchanged as 'unlist(lance, recursive = F)'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... dogroups: growing from 872581 to 18513352 rows
Wrote less rows (16070070) than allocated (18513352).

  memcpy contiguous groups took 0.048s for 872581 groups
  eval(j) took 1.560s for 872581 calls
14.3s elapsed (11.1s cpu) 
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252    LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C                       LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4-7 data.table_1.12.8    lubridate_1.7.9      stringi_1.4.6        runner_0.3.7         e1071_1.7-3         
[7] ggplot2_3.3.2        stringr_1.4.0        magrittr_1.5        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5       pillar_1.4.6     compiler_4.0.2   class_7.3-17     tools_4.0.2      digest_0.6.25    packrat_0.5.0    evaluate_0.14   
 [9] lifecycle_0.2.0  tibble_3.0.3     gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.7      rstudioapi_0.11  yaml_2.2.1       xfun_0.16       
[17] withr_2.2.0      dplyr_1.0.0      knitr_1.29       generics_0.0.2   vctrs_0.3.2      grid_4.0.2       tidyselect_1.1.0 glue_1.4.1      
[25] R6_2.4.1         rmarkdown_2.3    purrr_0.3.4      scales_1.1.1     ellipsis_0.3.1   htmltools_0.5.0  colorspace_1.4-1 tinytex_0.25    
[33] munsell_0.5.0    crayon_1.3.4  

Processing ends in 10 seconds.

V 1.13.0

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252    LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C                       LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lubridate_1.7.9   stringi_1.4.6     runner_0.3.7      e1071_1.7-3       ggplot2_3.3.2     stringr_1.4.0     magrittr_1.5     
[8] data.table_1.13.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5       pillar_1.4.6     compiler_4.0.2   class_7.3-17     tools_4.0.2      digest_0.6.25    packrat_0.5.0    evaluate_0.14   
 [9] lifecycle_0.2.0  tibble_3.0.3     gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.7      rstudioapi_0.11  yaml_2.2.1       xfun_0.16       
[17] withr_2.2.0      dplyr_1.0.0      knitr_1.29       generics_0.0.2   vctrs_0.3.2      grid_4.0.2       tidyselect_1.1.0 glue_1.4.1      
[25] R6_2.4.1         rmarkdown_2.3    purrr_0.3.4      scales_1.1.1     ellipsis_0.3.1   htmltools_0.5.0  colorspace_1.4-1 tinytex_0.25    
[33] munsell_0.5.0    crayon_1.3.4    
> item_int <- item[, unlist(lance, recursive = F ), by = unnest_names, verbose = TRUE ] 
Detected that j uses these columns: lance 
Finding groups using forderv ... forder.c received 872581 rows and 11 columns
0.160s elapsed (0.250s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.020s elapsed (0.010s cpu) 
lapply optimization is on, j unchanged as 'unlist(lance, recursive = F)'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.

From here processing takes forever.

> table(lengths(item$lance))

     0      8 
 75171 797410 
like image 333
Fabio Correa Avatar asked Nov 06 '22 06:11

Fabio Correa


1 Answers

SOLVED!! I changed data.table version from 1.13.0 to 1.12.8, it took only 4 seconds to process all my 800k row dataset.

like image 74
Fabio Correa Avatar answered Nov 15 '22 06:11

Fabio Correa