Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast concatenation of data.table columns into one string column

Tags:

Given an arbitrary list of column names in a data.table, I want to concatenate the contents of those columns into a single string stored in a new column. The columns I need to concatenate are not always the same, so I need to generate the expression to do so on the fly.

I have a sneaking suspicion that way I'm using the eval(parse(...)) call could be replaced with something a bit more elegant, but the method below is the fastest I've been able to get it so far.

With 10 million rows, this takes about 21.7 seconds on this sample data (base R paste0 takes slightly longer -- 23.6 seconds). My actual data has 18-20 columns being concatenated and up to 100 million rows, so the slowdown becomes a little more impractical.

Any ideas to get this sped up?


Current methods

library(data.table) library(stringi)  RowCount <- 1e7 DT <- data.table(x = "foo",                  y = "bar",                  a = sample.int(9, RowCount, TRUE),                  b = sample.int(9, RowCount, TRUE),                  c = sample.int(9, RowCount, TRUE),                  d = sample.int(9, RowCount, TRUE),                  e = sample.int(9, RowCount, TRUE),                  f = sample.int(9, RowCount, TRUE))  ## Generate an expression to paste an arbitrary list of columns together ConcatCols <- c("x","a","b","c","d","e","f","y") PasteStatement <- stri_c('stri_c(',stri_c(ConcatCols,collapse = ","),')') print(PasteStatement) 

gives

[1] "stri_c(x,a,b,c,d,e,f,y)" 

which is then used to concatenate the columns with the following expression:

DT[,State := eval(parse(text = PasteStatement))] 

Sample of output:

     x   y a b c d e f        State 1: foo bar 4 8 3 6 9 2 foo483692bar 2: foo bar 8 4 8 7 8 4 foo848784bar 3: foo bar 2 6 2 4 3 5 foo262435bar 4: foo bar 2 4 2 4 9 9 foo242499bar 5: foo bar 5 9 8 7 2 7 foo598727bar 

Profiling Results

Flame Graph Data


Update 1: fread, fwrite, and sed

Following @Gregor 's suggestion, tried using sed to do the concatenation on disk. Thanks to data.table's blazing fast fread and fwrite functions, I was able to write out the columns to disk, eliminate comma delimiters using sed ,and then read back in the post-processed output in about 18.3 seconds -- not quite fast enough to make the switch, but an interesting tangent nonetheless!

ConcatCols <- c("x","a","b","c","d","e","f","y") fwrite(DT[,..ConcatCols],"/home/xxx/DT.csv") system("sed 's/,//g' /home/xxx/DT.csv > /home/xxx/DT_Post.csv ") Post <- fread("/home/xxx/DT_Post.csv") DT[,State := Post[[1]]] 

Breakdown of the 18.3 overall seconds (unable to use profvis since sed is invisible to the R profiler)

  • data.table::fwrite() - 0.5 seconds
  • sed- 14.8 seconds
  • data.table::fread() - 3.0 seconds
  • := - 0.0 seconds

If nothing else, this is a testament to the extensive work of the data.table authors on performance optimizations for disk IO. (I'm using the 1.10.5 development version that adds multi-threading to fread, fwrite has been multithreaded for some time).

One caveat: if there is a workaround to write the file using fwrite and a blank separator as suggested by @Gregor in another comment below, then this method could plausibly be cut down to ~3.5 seconds!

Update on this tangent: forked data.table and commented out the line requiring a separator greater than length 0, mysteriously got some spaces instead? After causing a handful of segfaults trying to mess around with the C internals I put this one on ice for the time being. Ideal solution would not require writing to disk and would keep everything in memory.


Update 2: sprintf for Integer Specific Cases

A second update here: While I included strings in my original usage example, my actual use case exclusively concatenates integer values (which can always be assumed non-null based on upstream cleaning steps).

Since the usage case is highly specific and differs from the original question I won't directly compare timings to those previously posted. However, one takeaway is that while stringi nicely handles many character encoding formats, mixed vector types without needing to specify them, and does a bunch of error handling out of the box, this does add some time (which is probably worth it for most cases).

By using base R's sprintf function and letting it know up front that all of the inputs will be integers, we can shave off about 30% of the run-time for 5 million rows with 18 integer columns to be calculated. (20.3 seconds instead of 28.9)

library(data.table) library(stringi) RowCount <- 5e6 DT <- data.table(x = "foo",                  y = "bar",                  a = sample.int(9, RowCount, TRUE),                  b = sample.int(9, RowCount, TRUE),                  c = sample.int(9, RowCount, TRUE),                  d = sample.int(9, RowCount, TRUE),                  e = sample.int(9, RowCount, TRUE),                  f = sample.int(9, RowCount, TRUE))  ## Generate an expression to paste an arbitrary list of columns together ConcatCols <- list("a","b","c","d","e","f") ## Do it 3x as many times ConcatCols <- c(ConcatCols,ConcatCols,ConcatCols)  ## Using stringi::stri_c --------------------------------------------------- stri_joinStatement <- stri_c('stri_join(',stri_c(ConcatCols,collapse = ","),', sep="", collapse=NULL, ignore_null=TRUE)') DT[, State := eval(parse(text = stri_joinStatement))]  ## Using sprintf ----------------------------------------------------------- sprintfStatement <- stri_c("sprintf('",stri_flatten(rep("%i",length(ConcatCols))),"', ",stri_c(ConcatCols,collapse = ","),")") DT[,State_sprintf_i := eval(parse(text = sprintfStatement))] 

The generated statements are as follows:

> cat(stri_joinStatement) stri_join(a,b,c,d,e,f,a,b,c,d,e,f,a,b,c,d,e,f, sep="", collapse=NULL, ignore_null=TRUE) > cat(sprintfStatement) sprintf('%i%i%i%i%i%i%i%i%i%i%i%i%i%i%i%i%i%i', a,b,c,d,e,f,a,b,c,d,e,f,a,b,c,d,e,f) 

sprintf


Update 3: R does not have to be slow.

Based off the answer by @Martin Modrák, I put together a one-trick pony package based on some data.table internals specialized for the specialized "single digit integer" case: fastConcat. (Don't look for it on CRAN any time soon, but you can use it at your own risk by installing from github repo, msummersgill/fastConcat.)

This could probably be improved much further by someone who understands c better, but for now, it's running the same case as in Update 2 in 2.5 seconds -- around 8x faster than sprintf() and 11.5x faster than the stringi::stri_c()method I was using originally.

To me, this highlights the huge opportunity for performance improvements on some of the simplest operations in R like rudimentary string-vector concatenation with better tuned c. I guess people like @Matt Dowle have seen this for years-- if only he had the time to re-write all of R, not just the data.frame.

fastConcat


like image 635
Matt Summersgill Avatar asked Jan 12 '18 20:01

Matt Summersgill


People also ask

How do I concatenate multiple columns in SQL?

To concatenate more than 2 fields with SQL, you can use CONCAT() or CONCAT_WS() function.


2 Answers

C to the rescue!

Stealing some code from data.table we can write a C function that works way faster (and could be parallelized to be even faster).

First make sure you have a working C++ toolchain with:

library(inline)  fx <- inline::cfunction( signature(x = "integer", y = "numeric" ) , '     return ScalarReal( INTEGER(x)[0] * REAL(y)[0] ) ; ' ) fx( 2L, 5 ) #Should return 10 

Then this should work (assuming integer-only data, but the code can be extended to other types):

library(inline) library(data.table) library(stringi)  header <- "  //Taken from https://github.com/Rdatatable/data.table/blob/master/src/fwrite.c static inline void reverse(char *upp, char *low) {   upp--;   while (upp>low) {   char tmp = *upp;   *upp = *low;   *low = tmp;   upp--;   low++;   } }  void writeInt32(int *col, size_t row, char **pch) {   char *ch = *pch;   int x = col[row];   if (x == INT_MIN) {   *ch++ = 'N';   *ch++ = 'A';   } else {   if (x<0) { *ch++ = '-'; x=-x; }   // Avoid log() for speed. Write backwards then reverse when we know how long.   char *low = ch;   do { *ch++ = '0'+x%10; x/=10; } while (x>0);   reverse(ch, low);   }   *pch = ch; }  //end of copied code   "     worker_fun <- inline::cfunction( signature(x = "list", preallocated_target = "character", columns = "integer", start_row = "integer", end_row = "integer"), includes = header , "   const size_t _start_row = INTEGER(start_row)[0] - 1;   const size_t _end_row = INTEGER(end_row)[0];    const int max_out_len = 256 * 256; //max length of the final string   char buffer[max_out_len];   const size_t num_elements = _end_row - _start_row;   const size_t num_columns = LENGTH(columns);   const int * _columns = INTEGER(columns);    for(size_t i = _start_row; i < _end_row; ++i) {     char *buf_pos = buffer;     for(size_t c = 0; c < num_columns; ++c) {       if(c > 0) {         buf_pos[0] = ',';         ++buf_pos;       }       writeInt32(INTEGER(VECTOR_ELT(x, _columns[c] - 1)), i, &buf_pos);     }     SET_STRING_ELT(preallocated_target,i, mkCharLen(buffer, buf_pos - buffer));   } return preallocated_target; " )  #Test with the same data  RowCount <- 5e6 DT <- data.table(x = "foo",                  y = "bar",                  a = sample.int(9, RowCount, TRUE),                  b = sample.int(9, RowCount, TRUE),                  c = sample.int(9, RowCount, TRUE),                  d = sample.int(9, RowCount, TRUE),                  e = sample.int(9, RowCount, TRUE),                  f = sample.int(9, RowCount, TRUE))  ## Generate an expression to paste an arbitrary list of columns together ConcatCols <- list("a","b","c","d","e","f") ## Do it 3x as many times ConcatCols <- c(ConcatCols,ConcatCols,ConcatCols)   ptm <- proc.time() preallocated_target <- character(RowCount) column_indices <- sapply(ConcatCols, FUN = function(x) { which(colnames(DT) == x )}) x <- worker_fun(DT, preallocated_target, column_indices, as.integer(1), as.integer(RowCount)) DT[, State := preallocated_target] proc.time() - ptm 

While your (integer only) example runs in about 20s on my PC, this runs in ~5s and can be easily parallelized.

Some things to note:

  • The code is not production ready - a lot of sanity checks should be made on the function inputs (especially checking if all columns are the same length, checking column types, preallocated_target size etc.)
  • The function puts its output into a preallocated character vector, this is non-standard and ugly (R usually does not have pass-by-reference semantics) but allows for parallelization (see below).
  • The last two parameters are start and end rows to be processed, once again, this is for paralellization
  • The function accepts column indices not column names. All columns have to be of type integer.
  • Except for the input data.table and preallocated_target the inputs have to be integers
  • Compilation time for the function is not included (as you should compile it beforehand - maybe even make a package)

Parallelization

EDIT: The approach below would actually fail due to the way clusterExport and R string storage work. Paralellization thus probably needs to be done in C as well, similarly to the way it is achieved in data.table.

Since you cannot pass inline-compiled functions across R processes, paralellization requires some more work. To be able to use the above function in parallel, you either need to compile it separately with R compiler and use dyn.load OR wrap it in a package OR use a forking backend for parallel (I don't have one, forking works only on UNIX).

Running in parallel would then look something like (not tested):

no_cores <- detectCores()  # Initiate cluster cl <- makeCluster(no_cores)  #Preallocated target and prepare params num_elements <- length(DT[[1]]) preallocated_target <- character(num_elements) block_size <- 4096 #No of rows processed at once. Adjust for best performance column_indices <- sapply(ConcatCols, FUN = function(x) { which(colnames(DT) == x )})  num_blocks <- ceiling(num_elements / block_size)  clusterExport(cl,     c("DT","preallocated_target","column_indices","num_elements", "block_size")) clusterEvalQ(cl, <CODE TO LOAD THE NATIVE FUNCTION HERE>)  parLapply(cl, 1:num_blocks ,           function(block_id)           {             throw_away <-                worker_fun(DT, preallocated_target, columns,                (block_id - 1) * block_size + 1, min(num_elements, block_id * block_size - 1))             return(NULL)           })    stopCluster(cl) 
like image 122
Martin Modrák Avatar answered Oct 02 '22 07:10

Martin Modrák


I don't know how representative the sample data is for your actual data, but in the case of your sampled data you can achieve a substantial performance improvement by only concatenating each unique combination of ConcatCols once instead of multiple times.

That means for the sample data, you'd be looking at ~500k concatenations vs 10 million if you do all the duplicates too.

See the following code and timing example:

system.time({   setkeyv(DT, ConcatCols)   DTunique <- unique(DT[, ConcatCols, with=FALSE], by = key(DT))   DTunique[, State :=  do.call(paste, c(DTunique, sep = ""))]   DT[DTunique, State := i.State, on = ConcatCols] }) #       user      system     elapsed  #      7.448       0.462       4.618  

About half the time is spent on the setkey part. In case your data is already keyed, the time is cut down further to just a bit more than 2 seconds.

setkeyv(DT, ConcatCols) system.time({   DTunique <- unique(DT[, ConcatCols, with=FALSE], by = key(DT))   DTunique[, State :=  do.call(paste, c(DTunique, sep = ""))]   DT[DTunique, State := i.State, on = ConcatCols] }) #       user      system     elapsed  #      2.526       0.280       2.181  
like image 42
talat Avatar answered Oct 02 '22 06:10

talat