Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast and efficient character DataFrame creation in Rcpp

Tags:

r

rcpp

I have written a parser that reads in character values into a std::vector<std::string> vector, it parses 1 million records in sub seconds. Then I want to convert the vectors to a Rcpp::DataFrame, which takes more than than 80 seconds...

Is there a way to create an Rcpp::DataFrame from large character vectors "efficiently"?

When using numeric values, I would try to std::memcpy() the std::vector to Rcpp::NumericVector (see this int64 example or this data.table example for more information), but this does not seem to work with character vectors due to their varying sizes.

More Information

The basic idea of the function is to parse Sudoku string data (each Sudoku string is exactly 81 characters long), there are two Sudokus in each line (the data is saved as a .csv file, you can find the data here).

$ head sudoku.csv
quizzes,solutions
004300209005009001070060043006002087190007400050083000600000105003508690042910300,864371259325849761971265843436192587198657432257483916689734125713528694542916378
040100050107003960520008000000000017000906800803050620090060543600080700250097100,346179258187523964529648371965832417472916835813754629798261543631485792254397186
600120384008459072000006005000264030070080006940003000310000050089700000502000190,695127384138459672724836915851264739273981546946573821317692458489715263562348197
497200000100400005000016098620300040300900000001072600002005870000600004530097061,497258316186439725253716498629381547375964182841572639962145873718623954534897261
005910308009403060027500100030000201000820007006007004000080000640150700890000420,465912378189473562327568149738645291954821637216397854573284916642159783891736425
100005007380900000600000480820001075040760020069002001005039004000020100000046352,194685237382974516657213489823491675541768923769352841215839764436527198978146352
009065430007000800600108020003090002501403960804000100030509007056080000070240090,289765431317924856645138729763891542521473968894652173432519687956387214178246395
000000657702400100350006000500020009210300500047109008008760090900502030030018206,894231657762495183351876942583624719219387564647159328128763495976542831435918276
503070190000006750047190600400038000950200300000010072000804001300001860086720005,563472198219386754847195623472638519951247386638519472795864231324951867186723945 

Inside of the cpp read function, I fread() the file, fill a buffer (buffer) and parse the data to said std::vector<std::string> vectors (a and b in this example)

Note that the full code including the experiments I have done so far can be found in this gist.

const int BUFFERSIZE = 1e8;
const int n_lines = count_lines(filename); // 1 million in this case

FILE* infile;
infile = fopen(filename.c_str(), "r");
unsigned char * buffer;
buffer = (unsigned char*) malloc(BUFFERSIZE);
int64_t this_buffer_size;
std::vector<std::string> a, b;
a.resize(n_lines);
b.resize(n_lines);

// removing of header not shown here...
// BUFFERSIZE is also checked so that no overflow occurs... not shown here..

int line = 0;
while ((this_buffer_size = fread(buffer, 1, BUFFERSIZE, infile)) > 0) {
  int i = 1;
  while (i < buffer) {
    // buffer from i to i + 81 would look like this:
    // 004300209005009001070060043006002087190007400050083000600000105003508690042910300
    // whereas for b it looks from i to i + 81 like this:
    // 864371259325849761971265843436192587198657432257483916689734125713528694542916378
    a[line] = std::string(buffer + i, buffer + i + 81);
    i += 81 + 1; // skip to the next value, +1 for the , or a newline
    b[line] = std::string(buffer + i, buffer + i + 81);
    i += 81 + 1; // skip to the next value, +1 for the , or a newline
    line++;
  }
  // check next buffer, not shown here...
}

// NEXT: parse the data to an R structure

This takes for the 1 million row dataset below 250ms.

Then I want to create a Rcpp::DataFrame from the two vectors a and b, this is where the question comes in. The conversion to the R object takes around 80 seconds.

Is there any faster alternative, taking the data knowledge into account (2 items per row, each 81 chars long, 1 million rows, ...)?

I am not bound on filling the std::vectors first, if possible I can also directly collect the data in an Rcpp structure.

What I have tried so far

Textbook solution

Rcpp::DataFrame df = Rcpp::DataFrame::create(
  Rcpp::Named("unsolved") = a,
  Rcpp::Named("solved") = b,
  Rcpp::Named("stringsAsFactors") = false
);

List first


Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

df["unsolved"] = a;
df["solved"] = b;

df.attr("class") = Rcpp::CharacterVector::create("data.frame");

Character Matrix

Not truly comparable to the other approaches, but felt more native...

// before the main loop
std::vector<std::string> vec;
// vec holds both data entries, the first (unsolved) at 0 -> n_lines and solved values at n_lines -> n_lines * 2
vec.resize(2 * n_lines);


// inside the loop
vec[l] = std::string(buffer + i, buffer + i + 81);
i += 82;
vec[l + n_lines] = std::string(buffer + i, buffer + i + 81);
i += 82;
l++;


// to CharacterMatrix
Rcpp::CharacterMatrix res(n_lines, 2, vec.begin());

Full Code and Timings on Github

like image 598
David Avatar asked Nov 27 '20 19:11

David


2 Answers

Thanks for making a snapshot of the data available (BTW: no point tar'ing a single file, you could just have xz'ed the csvfile. Anyway.)

I get different results on my Ubuntu 20.04 box which are closer to what I anticipated:

  • data.table::fread() is competitive as we expected (I am running data.table from git as there was a regression in the most recent release)
  • vroom and stringfish, once we force materialization to compare apples to apples rather than images of apples, are about the same
  • Rcpp is in the ballpark too but a little more variable

I capped it at 10 runs, the variability probably comes down if you run more but the caching influences it too.

In short: no clear winners, and surely no mandate to replace one of the (alreadty known to be tuned) reference implementations.

edd@rob:~/git/stackoverflow/65043010(master)$ Rscript bm.R
Unit: seconds
  expr     min      lq    mean  median      uq     max neval cld
 fread 1.37294 1.51211 1.54004 1.55138 1.57639 1.62939    10   a
 vroom 1.44670 1.53659 1.62104 1.61172 1.61764 1.88921    10   a
 sfish 1.21609 1.57000 1.57635 1.60180 1.63933 1.72975    10   a
 rcpp1 1.44111 1.45354 1.61275 1.55190 1.60535 2.15847    10   a
 rcpp2 1.47902 1.57970 1.75067 1.60114 1.64857 2.75851    10   a
edd@rob:~/git/stackoverflow/65043010(master)$ 

Code for top-level script

suppressMessages({
    library(data.table)
    library(Rcpp)
    library(vroom)
    library(stringfish)
    library(microbenchmark)
})

vroomread <- function(csvfile) {
    a <- vroom(csvfile, col_types = "cc", progress = FALSE)
    vroom:::vroom_materialize(a, TRUE)
}
sfread <- function(csvfile) {
    a <- sf_readLines(csvfile)
    dt <- data.table::data.table(uns = sf_substr(a, 1, 81),
                                 sol = sf_substr(a, 83, 163))
}

sourceCpp("rcppfuncs.cpp")


csvfile <- "sudoku_100k.csv"
microbenchmark(fread=fread(csvfile),
               vroom=vroomread(csvfile),
               sfish=sfread(csvfile),
               rcpp1=setalloccol(read_to_df_ifstream(csvfile)),
               rcpp2=setalloccol(read_to_df_ifstream_charvector(csvfile)),
               times=10)

Code for Rcpp script

#include <Rcpp.h>
#include <fstream>

//[[Rcpp::export]]

Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename, std::ifstream::in);

  std::string line;
  // burn the header
  std::getline(file, line);

  std::vector<std::string> a, b;
  a.reserve(n_lines);
  b.reserve(n_lines);

  while (std::getline(file, line)) {
    a.push_back(line.substr(0, 80));
    b.push_back(line.substr(82, 162));
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename, std::ifstream::in);

  std::string line;
  // burn the header
  std::getline(file, line);

  Rcpp::CharacterVector a(n_lines), b(n_lines);

  int l = 0;
  while (std::getline(file, line)) {
    a(l) = line.substr(0, 80);
    b(l) = line.substr(82, 162);
    l++;
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}
like image 61
Dirk Eddelbuettel Avatar answered Nov 14 '22 00:11

Dirk Eddelbuettel


This is not really a proper answer to my question, more some thoughts I didn't want to go wasted as well as some benchmarks. Maybe useful to someone who faces a similar issue.

To recall, the basic idea is to read 1 million rows of two 81 character long strings into an R object (preferably a data.frame, data.table, or tibble). For the benchmarks I have used the 1 million sudoku dataset of Kyubyong Park.

I structured the answer into two parts: 1) using other R packages and 2) using Rcpp/C++ and C to work on a lower level.

Surprisingly, for character data specialised packages such as stringi, stringfish, or vroom are really efficient and beat (my) lower level C++/C code.

Important to note is that some packages use ALTREP (see for example Francoise take on them here), which means that the data does not materialize in R until needed. I.e., loading the data using vroom takes less than 1 second, but the first operations (which need to materialize the data) take way longer... To circumnavigate this, I either force the materialization of the data by putting it into a data.table or use an internal function of vroom to force it.

1) R packages

data.table and fread - 75 secs

Mainly as a base benchmark.

file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
dt <- data.table::fread(file, colClasses = "character")
tictoc::toc()
#> 75.296 sec elapsed

Vroom with materialization - 19 secs

Note that vroom uses ALTREP, forcing materialization to level the playing field!

file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- vroom::vroom(file, col_types = "cc", progress = FALSE)
# internal function that materializes the ALTREP data
df <- vroom:::vroom_materialize(a, TRUE)
tictoc::toc()
#> 19.926 sec elapsed

Stringfish - 19 secs

Stringfish uses ALTREP, so reading the data and taking the substrings takes less than one second. Materialization takes the rest, similar to vroom.

library(stringfish)
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- sf_readLines(file)

dt <- data.table::data.table(
  uns = sf_substr(a, 1, 81),
  sol = sf_substr(a, 83, 163)
)
tictoc::toc()
#> 19.698 sec elapsed

Stringi - 22 secs

Note that the conversion to data.table takes virtually no time.

tictoc::tic()
a <- stringi::stri_read_lines(file)
# discard header
a <- a[-1]

dt <- data.table::data.table(
  uns = stringi::stri_sub(a, 1, 81),
  sol = stringi::stri_sub(a, 83, 163)
)
tictoc::toc() 
#> 22.409 sec elapsed

2) C and Cpp functions

Rcpp with ifstream read to STL first - 22 secs

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename);

  std::string line;
  // burn the header
  std::getline(file, line);

  std::vector<std::string> a, b;
  a.reserve(n_lines);
  b.reserve(n_lines);

  while (std::getline(file, line)) {
    a.push_back(line.substr(0, 80));
    b.push_back(line.substr(82, 162));
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 22.098 sec elapsed
*/

Rcpp with ifstream read directly to Rcpp::CharacterVector - 21 secs

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename);

  std::string line;
  // burn the header
  std::getline(file, line);

  Rcpp::CharacterVector a(n_lines), b(n_lines);

  int l = 0;
  while (std::getline(file, line)) {
    a(l) = line.substr(0, 80);
    b(l) = line.substr(82, 162);
    l++;
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream_charvector(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 21.436 sec elapsed
*/

Rcpp with buffer - 75 secs

This is basically the initial approach I chose, as outlined in the question above. Not really sure why its slower than the others...

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_buffer(std::string filename) {
  const int max_buffer_size = 1e8;
  const int header_size = 18; // only fixed in this example...
  const int n_lines = 1000000;

  FILE* infile;
  infile = fopen(filename.c_str(), "r");
  if (infile == NULL) Rcpp::stop("File Error!\n");

  fseek(infile, 0L, SEEK_END);
  int64_t file_size = ftell(infile);
  fseek(infile, 0L, SEEK_SET);

  // initiate the buffers
  char* buffer;
  int64_t buffer_size = sizeof(char) * max_buffer_size > file_size
    ? file_size : max_buffer_size;
  buffer = (char*) malloc(buffer_size);

  // skip the header...
  int64_t this_buffer_size = fread(buffer, 1, header_size, infile);

  // a holds the first part (quizzes or unsolved) b holds solution/solved
  std::vector<std::string> a, b;
  a.resize(n_lines);
  b.resize(n_lines);

  const int line_length = 2 * 82; // 2 times 81 digits plus one , or newline
  int l = 0;
  // fill the buffer
  int current_pos = ftell(infile);
  int next_buffer_size = file_size - current_pos > buffer_size
    ? buffer_size : file_size - current_pos;

  while ((this_buffer_size = fread(buffer, 1, next_buffer_size, infile)) > 0) {
    // read a buffer from current_pos to ftell(infile)
    Rcpp::checkUserInterrupt();
    int i = 0;
    while (i + line_length <= this_buffer_size) {
      a[l] = std::string(buffer + i, buffer + i + 81);
      i += 82;
      b[l] = std::string(buffer + i, buffer + i + 81);;
      i += 82;
      l++;
    }

    if (i == 0) break;
    if (i != this_buffer_size) {
      // file pointer reset by i - this_buffer_size (offset to end of buffer)
      fseek(infile, i - this_buffer_size, SEEK_CUR);
    }
    // determine the next buffer size. If the buffer is too large, take only whats
    // needed
    current_pos = ftell(infile);
    next_buffer_size = file_size - current_pos > buffer_size
      ? buffer_size : file_size - current_pos;
  }

  free(buffer);
  fclose(infile);

  Rcpp::DataFrame df = Rcpp::DataFrame::create(
    Rcpp::Named("unsolved") = a,
    Rcpp::Named("solved") = b,
    Rcpp::Named("stringsAsFactors") = false
  );
  return df;
}

/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_buffer(file)
tictoc::toc()
75.915 sec elapsed
*/

Using Rs C API - 125 secs

Not sure why this is not faster, probably because my C code is not efficient... If you have any improvements, I'll gladly update the timings.

The mkChar() function creates a CHARSXP which can be inserted into a character vector STRSXP. Note that most R characters are stored in a cache (see also 1.10 of R Internals), maybe if we can circumvent the cache we can gain some speedups - not sure how to do this or if this is wise in any way...

Preferably, I would like to pre allocate 1 mln STRSXP of size 81, memcpy() the values from the C array, and SET_STRING_ELT() them to the vector. No idea how to do it, though.

See also:

  • https://cran.r-project.org/doc/manuals/r-release/R-ints.html
  • http://adv-r.had.co.nz/C-interface.html
  • https://github.com/hadley/r-internals/
read_to_list_sexp <- inline::cfunction(c(fname = "character"), '
  const char * filename = CHAR(asChar(fname));

  FILE* infile;
  infile = fopen(filename, "r");
  if (infile == NULL) error("File cannot be opened");

  fseek(infile, 0L, SEEK_END);
  int64_t file_size = ftell(infile);
  fseek(infile, 0L, SEEK_SET);

  const int n_lines = 1000000;

  SEXP uns = PROTECT(allocVector(STRSXP, n_lines));
  SEXP sol = PROTECT(allocVector(STRSXP, n_lines));

  char * line = NULL;
  size_t len = 0;
  ssize_t read;

  int l = 0;

  char char_array[82];
  char_array[81] = 0;
  // skip header
  read = getline(&line, &len, infile);

  while ((read = getline(&line, &len, infile)) != -1) {
    memcpy(char_array, line, 81);
    SET_STRING_ELT(uns, l, mkChar(char_array));

    memcpy(char_array, line + 82, 81);
    SET_STRING_ELT(sol, l, mkChar(char_array));

    l++;
    if (l == n_lines) break;
  }
  fclose(infile);

  SEXP res = PROTECT(allocVector(VECSXP, 2));

  SET_VECTOR_ELT(res, 0, uns);
  SET_VECTOR_ELT(res, 1, sol);

  UNPROTECT(3);
  return res;
')

file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- foo(file)
df <- data.table::as.data.table(a)
tictoc::toc()
#> 125.514 sec elapsed
like image 29
David Avatar answered Nov 14 '22 00:11

David