I was inspired by the fst package to try to write a C++ function to quickly serialize some data structures I have in R to disk.
But I am having trouble achieving the same write speed even on very simple objects. The code below is a simple example of writing a large 1 GB vector to disk.
Using custom C++ code, I achieve a write speed of 135 MB/s, which is the limit of my disk according to CrystalBench.
On the same data, write_fst achieves a write speed of 223 MB/s, which seems impossible since my disk can't write that fast. (Note, I am using fst::threads_fst(1) and compress=0 settings, and the files have the same data size.)
What am I missing?
How can I get the C++ function to write to disk faster?
C++ Code:
#include <Rcpp.h>
#include <fstream>
#include <cstring>
#include <iostream>
// [[Rcpp::plugins(cpp11)]]
using namespace Rcpp;
// [[Rcpp::export]]
void test(SEXP x) {
char* d = reinterpret_cast<char*>(REAL(x));
long dl = Rf_xlength(x) * 8;
std::ofstream OutFile;
OutFile.open("/tmp/test.raw", std::ios::out | std::ios::binary);
OutFile.write(d, dl);
OutFile.close();
}
R Code:
library(microbenchmark)
library(Rcpp)
library(dplyr)
library(fst)
fst::threads_fst(1)
sourceCpp("test.cpp")
x <- runif(134217728) # 1 gigabyte
df <- data.frame(x)
microbenchmark(test(x), write_fst(df, "/tmp/test.fst", compress=0), times=3)
Unit: seconds
expr min lq mean median uq max neval
test(x) 6.549581 7.262408 7.559021 7.975235 8.063740 8.152246 3
write_fst(df, "/tmp/test.fst", compress = 0) 4.548579 4.570346 4.592398 4.592114 4.614307 4.636501 3
file.info("/tmp/test.fst")$size/1e6
# [1] 1073.742
file.info("/tmp/test.raw")$size/1e6
# [1] 1073.742
Benchmarking SSD write and read performance is a tricky business and hard to do right. There are many effects to take into account.
For example, many SSD's use techniques to accelerate data speeds (intelligently), such as DRAM caching. Those techniques can increase your write speed, especially in cases where an identical dataset is written to disk multiple times, as in your example. To avoid this effect, each iteration of the benchmark should write a unique dataset to disk.
The block sizes of write and read operations are also important: the default physical sector size of SSD's is 4KB. Writing smaller blocks hampers performance, but with fst I found that writing blocks of data larger than a few MB's also lowers performance, due to CPU cache effects. Because fst writes it's data to disk in relatively small chunks, it's usually faster than alternatives that write data in a single large block.
To facilitate this block-wise writing to SSD, you could modify your code:
Rcpp::cppFunction('
#include <fstream>
#include <cstring>
#include <iostream>
#define BLOCKSIZE 262144 // 2^18 bytes per block
long test_blocks(SEXP x, Rcpp::String path) {
char* d = reinterpret_cast<char*>(REAL(x));
std::ofstream outfile;
outfile.open(path.get_cstring(), std::ios::out | std::ios::binary);
long dl = Rf_xlength(x) * 8;
long nr_of_blocks = dl / BLOCKSIZE;
for (long block_nr = 0; block_nr < nr_of_blocks; block_nr++) {
outfile.write(&d[block_nr * BLOCKSIZE], BLOCKSIZE);
}
long remaining_bytes = dl % BLOCKSIZE;
outfile.write(&d[nr_of_blocks * BLOCKSIZE], remaining_bytes);
outfile.close();
return dl;
}
')
Now we can compare methods test, test_blocks and fst::write_fst in a single benchmark:
x <- runif(134217728) # 1 gigabyte
df <- data.frame(X = x)
fst::threads_fst(1) # use fst in single threaded mode
microbenchmark::microbenchmark(
test(x, "test.bin"),
test_blocks(x, "test.bin"),
fst::write_fst(df, "test.fst", compress = 0),
times = 10)
#> Unit: seconds
#> expr min lq mean
#> test(x, "test.bin") 1.473615 1.506019 1.590430
#> test_blocks(x, "test.bin") 1.018082 1.062673 1.134956
#> fst::write_fst(df, "test.fst", compress = 0) 1.127446 1.144039 1.249864
#> median uq max neval
#> 1.600055 1.635883 1.765512 10
#> 1.131631 1.204373 1.264220 10
#> 1.261269 1.327304 1.343248 10
As you can see, the modified method test_blocks is about 40 percent faster than the original method and even slightly faster than the fst package. This is expected, because fst has some overhead in storing column and table information, (possible) attributes, hashes and compression information.
Please note that the difference between fst and your initial test method is much less pronounced on my system, showing again the challenges in using benchmarks to optimize a system.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With