I have written a parser that reads in character values into a std::vector<std::string>
vector, it parses 1 million records in sub seconds. Then I want to convert the vectors to a Rcpp::DataFrame
, which takes more than than 80 seconds...
Is there a way to create an Rcpp::DataFrame
from large character vectors "efficiently"?
When using numeric values, I would try to std::memcpy()
the std::vector
to Rcpp::NumericVector
(see this int64 example or this data.table example for more information), but this does not seem to work with character vectors due to their varying sizes.
The basic idea of the function is to parse Sudoku string data (each Sudoku string is exactly 81 characters long), there are two Sudokus in each line (the data is saved as a .csv
file, you can find the data here).
$ head sudoku.csv
quizzes,solutions
004300209005009001070060043006002087190007400050083000600000105003508690042910300,864371259325849761971265843436192587198657432257483916689734125713528694542916378
040100050107003960520008000000000017000906800803050620090060543600080700250097100,346179258187523964529648371965832417472916835813754629798261543631485792254397186
600120384008459072000006005000264030070080006940003000310000050089700000502000190,695127384138459672724836915851264739273981546946573821317692458489715263562348197
497200000100400005000016098620300040300900000001072600002005870000600004530097061,497258316186439725253716498629381547375964182841572639962145873718623954534897261
005910308009403060027500100030000201000820007006007004000080000640150700890000420,465912378189473562327568149738645291954821637216397854573284916642159783891736425
100005007380900000600000480820001075040760020069002001005039004000020100000046352,194685237382974516657213489823491675541768923769352841215839764436527198978146352
009065430007000800600108020003090002501403960804000100030509007056080000070240090,289765431317924856645138729763891542521473968894652173432519687956387214178246395
000000657702400100350006000500020009210300500047109008008760090900502030030018206,894231657762495183351876942583624719219387564647159328128763495976542831435918276
503070190000006750047190600400038000950200300000010072000804001300001860086720005,563472198219386754847195623472638519951247386638519472795864231324951867186723945
Inside of the cpp read function, I fread()
the file, fill a buffer (buffer
) and parse the data to said std::vector<std::string>
vectors (a
and b
in this example)
Note that the full code including the experiments I have done so far can be found in this gist.
const int BUFFERSIZE = 1e8;
const int n_lines = count_lines(filename); // 1 million in this case
FILE* infile;
infile = fopen(filename.c_str(), "r");
unsigned char * buffer;
buffer = (unsigned char*) malloc(BUFFERSIZE);
int64_t this_buffer_size;
std::vector<std::string> a, b;
a.resize(n_lines);
b.resize(n_lines);
// removing of header not shown here...
// BUFFERSIZE is also checked so that no overflow occurs... not shown here..
int line = 0;
while ((this_buffer_size = fread(buffer, 1, BUFFERSIZE, infile)) > 0) {
int i = 1;
while (i < buffer) {
// buffer from i to i + 81 would look like this:
// 004300209005009001070060043006002087190007400050083000600000105003508690042910300
// whereas for b it looks from i to i + 81 like this:
// 864371259325849761971265843436192587198657432257483916689734125713528694542916378
a[line] = std::string(buffer + i, buffer + i + 81);
i += 81 + 1; // skip to the next value, +1 for the , or a newline
b[line] = std::string(buffer + i, buffer + i + 81);
i += 81 + 1; // skip to the next value, +1 for the , or a newline
line++;
}
// check next buffer, not shown here...
}
// NEXT: parse the data to an R structure
This takes for the 1 million row dataset below 250ms.
Then I want to create a Rcpp::DataFrame
from the two vectors a
and b
, this is where the question comes in. The conversion to the R object takes around 80 seconds.
Is there any faster alternative, taking the data knowledge into account (2 items per row, each 81 chars long, 1 million rows, ...)?
I am not bound on filling the std::vector
s first, if possible I can also directly collect the data in an Rcpp
structure.
Rcpp::DataFrame df = Rcpp::DataFrame::create(
Rcpp::Named("unsolved") = a,
Rcpp::Named("solved") = b,
Rcpp::Named("stringsAsFactors") = false
);
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.frame");
Not truly comparable to the other approaches, but felt more native...
// before the main loop
std::vector<std::string> vec;
// vec holds both data entries, the first (unsolved) at 0 -> n_lines and solved values at n_lines -> n_lines * 2
vec.resize(2 * n_lines);
// inside the loop
vec[l] = std::string(buffer + i, buffer + i + 81);
i += 82;
vec[l + n_lines] = std::string(buffer + i, buffer + i + 81);
i += 82;
l++;
// to CharacterMatrix
Rcpp::CharacterMatrix res(n_lines, 2, vec.begin());
Thanks for making a snapshot of the data available (BTW: no point tar'ing a single file, you could just have xz
'ed the csvfile. Anyway.)
I get different results on my Ubuntu 20.04 box which are closer to what I anticipated:
data.table::fread()
is competitive as we expected (I am running data.table
from git
as there was a regression in the most recent release)vroom
and stringfish
, once we force materialization to compare apples to apples rather than images of apples, are about the sameRcpp
is in the ballpark too but a little more variableI capped it at 10 runs, the variability probably comes down if you run more but the caching influences it too.
In short: no clear winners, and surely no mandate to replace one of the (alreadty known to be tuned) reference implementations.
edd@rob:~/git/stackoverflow/65043010(master)$ Rscript bm.R
Unit: seconds
expr min lq mean median uq max neval cld
fread 1.37294 1.51211 1.54004 1.55138 1.57639 1.62939 10 a
vroom 1.44670 1.53659 1.62104 1.61172 1.61764 1.88921 10 a
sfish 1.21609 1.57000 1.57635 1.60180 1.63933 1.72975 10 a
rcpp1 1.44111 1.45354 1.61275 1.55190 1.60535 2.15847 10 a
rcpp2 1.47902 1.57970 1.75067 1.60114 1.64857 2.75851 10 a
edd@rob:~/git/stackoverflow/65043010(master)$
suppressMessages({
library(data.table)
library(Rcpp)
library(vroom)
library(stringfish)
library(microbenchmark)
})
vroomread <- function(csvfile) {
a <- vroom(csvfile, col_types = "cc", progress = FALSE)
vroom:::vroom_materialize(a, TRUE)
}
sfread <- function(csvfile) {
a <- sf_readLines(csvfile)
dt <- data.table::data.table(uns = sf_substr(a, 1, 81),
sol = sf_substr(a, 83, 163))
}
sourceCpp("rcppfuncs.cpp")
csvfile <- "sudoku_100k.csv"
microbenchmark(fread=fread(csvfile),
vroom=vroomread(csvfile),
sfish=sfread(csvfile),
rcpp1=setalloccol(read_to_df_ifstream(csvfile)),
rcpp2=setalloccol(read_to_df_ifstream_charvector(csvfile)),
times=10)
#include <Rcpp.h>
#include <fstream>
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename, std::ifstream::in);
std::string line;
// burn the header
std::getline(file, line);
std::vector<std::string> a, b;
a.reserve(n_lines);
b.reserve(n_lines);
while (std::getline(file, line)) {
a.push_back(line.substr(0, 80));
b.push_back(line.substr(82, 162));
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename, std::ifstream::in);
std::string line;
// burn the header
std::getline(file, line);
Rcpp::CharacterVector a(n_lines), b(n_lines);
int l = 0;
while (std::getline(file, line)) {
a(l) = line.substr(0, 80);
b(l) = line.substr(82, 162);
l++;
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
This is not really a proper answer to my question, more some thoughts I didn't want to go wasted as well as some benchmarks. Maybe useful to someone who faces a similar issue.
To recall, the basic idea is to read 1 million rows of two 81 character long strings into an R object (preferably a data.frame, data.table, or tibble). For the benchmarks I have used the 1 million sudoku dataset of Kyubyong Park.
I structured the answer into two parts: 1) using other R packages and 2) using Rcpp/C++ and C to work on a lower level.
Surprisingly, for character data specialised packages such as stringi
, stringfish
, or vroom
are really efficient and beat (my) lower level C++/C code.
Important to note is that some packages use ALTREP (see for example Francoise take on them here), which means that the data does not materialize in R until needed. I.e., loading the data using vroom takes less than 1 second, but the first operations (which need to materialize the data) take way longer... To circumnavigate this, I either force the materialization of the data by putting it into a data.table or use an internal function of vroom to force it.
fread
- 75 secsMainly as a base benchmark.
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
dt <- data.table::fread(file, colClasses = "character")
tictoc::toc()
#> 75.296 sec elapsed
Note that vroom uses ALTREP, forcing materialization to level the playing field!
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- vroom::vroom(file, col_types = "cc", progress = FALSE)
# internal function that materializes the ALTREP data
df <- vroom:::vroom_materialize(a, TRUE)
tictoc::toc()
#> 19.926 sec elapsed
Stringfish uses ALTREP, so reading the data and taking the substrings takes less than one second. Materialization takes the rest, similar to vroom.
library(stringfish)
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- sf_readLines(file)
dt <- data.table::data.table(
uns = sf_substr(a, 1, 81),
sol = sf_substr(a, 83, 163)
)
tictoc::toc()
#> 19.698 sec elapsed
Note that the conversion to data.table takes virtually no time.
tictoc::tic()
a <- stringi::stri_read_lines(file)
# discard header
a <- a[-1]
dt <- data.table::data.table(
uns = stringi::stri_sub(a, 1, 81),
sol = stringi::stri_sub(a, 83, 163)
)
tictoc::toc()
#> 22.409 sec elapsed
ifstream
read to STL first - 22 secs//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename);
std::string line;
// burn the header
std::getline(file, line);
std::vector<std::string> a, b;
a.reserve(n_lines);
b.reserve(n_lines);
while (std::getline(file, line)) {
a.push_back(line.substr(0, 80));
b.push_back(line.substr(82, 162));
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 22.098 sec elapsed
*/
ifstream
read directly to Rcpp::CharacterVector
- 21 secs//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename);
std::string line;
// burn the header
std::getline(file, line);
Rcpp::CharacterVector a(n_lines), b(n_lines);
int l = 0;
while (std::getline(file, line)) {
a(l) = line.substr(0, 80);
b(l) = line.substr(82, 162);
l++;
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream_charvector(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 21.436 sec elapsed
*/
This is basically the initial approach I chose, as outlined in the question above. Not really sure why its slower than the others...
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_buffer(std::string filename) {
const int max_buffer_size = 1e8;
const int header_size = 18; // only fixed in this example...
const int n_lines = 1000000;
FILE* infile;
infile = fopen(filename.c_str(), "r");
if (infile == NULL) Rcpp::stop("File Error!\n");
fseek(infile, 0L, SEEK_END);
int64_t file_size = ftell(infile);
fseek(infile, 0L, SEEK_SET);
// initiate the buffers
char* buffer;
int64_t buffer_size = sizeof(char) * max_buffer_size > file_size
? file_size : max_buffer_size;
buffer = (char*) malloc(buffer_size);
// skip the header...
int64_t this_buffer_size = fread(buffer, 1, header_size, infile);
// a holds the first part (quizzes or unsolved) b holds solution/solved
std::vector<std::string> a, b;
a.resize(n_lines);
b.resize(n_lines);
const int line_length = 2 * 82; // 2 times 81 digits plus one , or newline
int l = 0;
// fill the buffer
int current_pos = ftell(infile);
int next_buffer_size = file_size - current_pos > buffer_size
? buffer_size : file_size - current_pos;
while ((this_buffer_size = fread(buffer, 1, next_buffer_size, infile)) > 0) {
// read a buffer from current_pos to ftell(infile)
Rcpp::checkUserInterrupt();
int i = 0;
while (i + line_length <= this_buffer_size) {
a[l] = std::string(buffer + i, buffer + i + 81);
i += 82;
b[l] = std::string(buffer + i, buffer + i + 81);;
i += 82;
l++;
}
if (i == 0) break;
if (i != this_buffer_size) {
// file pointer reset by i - this_buffer_size (offset to end of buffer)
fseek(infile, i - this_buffer_size, SEEK_CUR);
}
// determine the next buffer size. If the buffer is too large, take only whats
// needed
current_pos = ftell(infile);
next_buffer_size = file_size - current_pos > buffer_size
? buffer_size : file_size - current_pos;
}
free(buffer);
fclose(infile);
Rcpp::DataFrame df = Rcpp::DataFrame::create(
Rcpp::Named("unsolved") = a,
Rcpp::Named("solved") = b,
Rcpp::Named("stringsAsFactors") = false
);
return df;
}
/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_buffer(file)
tictoc::toc()
75.915 sec elapsed
*/
Not sure why this is not faster, probably because my C code is not efficient... If you have any improvements, I'll gladly update the timings.
The mkChar()
function creates a CHARSXP
which can be inserted into a character vector STRSXP
. Note that most R characters are stored in a cache (see also 1.10 of R Internals), maybe if we can circumvent the cache we can gain some speedups - not sure how to do this or if this is wise in any way...
Preferably, I would like to pre allocate 1 mln STRSXP
of size 81, memcpy()
the values from the C array, and SET_STRING_ELT()
them to the vector. No idea how to do it, though.
See also:
read_to_list_sexp <- inline::cfunction(c(fname = "character"), '
const char * filename = CHAR(asChar(fname));
FILE* infile;
infile = fopen(filename, "r");
if (infile == NULL) error("File cannot be opened");
fseek(infile, 0L, SEEK_END);
int64_t file_size = ftell(infile);
fseek(infile, 0L, SEEK_SET);
const int n_lines = 1000000;
SEXP uns = PROTECT(allocVector(STRSXP, n_lines));
SEXP sol = PROTECT(allocVector(STRSXP, n_lines));
char * line = NULL;
size_t len = 0;
ssize_t read;
int l = 0;
char char_array[82];
char_array[81] = 0;
// skip header
read = getline(&line, &len, infile);
while ((read = getline(&line, &len, infile)) != -1) {
memcpy(char_array, line, 81);
SET_STRING_ELT(uns, l, mkChar(char_array));
memcpy(char_array, line + 82, 81);
SET_STRING_ELT(sol, l, mkChar(char_array));
l++;
if (l == n_lines) break;
}
fclose(infile);
SEXP res = PROTECT(allocVector(VECSXP, 2));
SET_VECTOR_ELT(res, 0, uns);
SET_VECTOR_ELT(res, 1, sol);
UNPROTECT(3);
return res;
')
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- foo(file)
df <- data.table::as.data.table(a)
tictoc::toc()
#> 125.514 sec elapsed
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With