I have a file with millions of lines, each line has 3 floats separated by spaces. It takes a lot of time to read the file, so I tried to read them using memory mapped files only to find out that the problem is not with the speed of IO but with the speed of the parsing.
My current parsing is to take the stream (called file) and do the following
float x,y,z; file >> x >> y >> z;
Someone in Stack Overflow recommended to use Boost.Spirit but I couldn't find any simple tutorial to explain how to use it.
I'm trying to find a simple and efficient way to parse a line that looks like this:
"134.32 3545.87 3425"
I will really appreciate some help. I wanted to use strtok to split it, but I don't know how to convert strings to floats, and I'm not quite sure it's the best way.
I don't mind if the solution will be Boost or not. I don't mind if it won't be the most efficient solution ever, but I'm sure that it is possible to double the speed.
Thanks in advance.
UPDATE
Since Spirit X3 is available for testing, I've updated the benchmarks. Meanwhile I've used Nonius to get statistically sound benchmarks.
All charts below are available interactive online
Benchmark CMake project + testdata used is on github: https://github.com/sehe/bench_float_parsing
Spirit parsers are fastest. If you can use C++14 consider the experimental version Spirit X3:
The above is measures using memory mapped files. Using IOstreams, it will be slower accross the board,
but not as slow as scanf
using C/POSIX FILE*
function calls:
What follows is parts from the OLD answer
I implemented the Spirit version, and ran a benchmark comparing to the other suggested answers.
Here's my results, all tests run on the same body of input (515Mb of
input.txt
). See below for exact specs.
(wall clock time in seconds, average of 2+ runs)To my own surprise, Boost Spirit turns out to be fastest, and most elegant:
- handles/reports errors
- supports +/-Inf and NaN and variable whitespace
- no problems at all detecting the end of input (as opposed to the other mmap answer)
looks nice:
bool ok = phrase_parse(f,l, // source iterators (double_ > double_ > double_) % eol, // grammar blank, // skipper data); // output attribute
Note that
boost::spirit::istreambuf_iterator
was unspeakably much slower (15s+). I hope this helps!Benchmark details
All parsing done into
vector
ofstruct float3 { float x,y,z; }
.Generate input file using
od -f -A none --width=12 /dev/urandom | head -n 11000000
This results in a 515Mb file containing data like
-2627.0056 -1.967235e-12 -2.2784738e+33 -1.0664798e-27 -4.6421956e-23 -6.917859e+20 -1.1080849e+36 2.8909405e-33 1.7888695e-12 -7.1663235e+33 -1.0840628e+36 1.5343362e-12 -3.1773715e-17 -6.3655537e-22 -8.797282e+31 9.781095e+19 1.7378472e-37 63825084 -1.2139188e+09 -5.2464635e-05 -2.1235992e-38 3.0109424e+08 5.3939846e+30 -6.6146894e-20
Compile the program using:
g++ -std=c++0x -g -O3 -isystem -march=native test.cpp -o test -lboost_filesystem -lboost_iostreams
Measure wall clock time using
time ./test < input.txt
Full code to the old benchmark is in the edit history of this post, the newest version is on github
If the conversion is the bottle neck (which is quite possible), you should start by using the different possiblities in the standard. Logically, one would expect them to be very close, but practically, they aren't always:
You've already determined that std::ifstream
is too slow.
Converting your memory mapped data to an std::istringstream
is almost certainly not a good solution; you'll first have to create a string, which will copy all of the data.
Writing your own streambuf
to read directly from the memory, without copying (or using the deprecated std::istrstream
) might be a solution, although if the problem really is the conversion... this still uses the same conversion routines.
You can always try fscanf
, or scanf
on your memory mapped stream. Depending on the implementation, they might be faster than the various istream
implementations.
Probably faster than any of these is to use strtod
. No need to tokenize for this: strtod
skips leading white space (including '\n'
), and has an out parameter where it puts the address of the first character not read. The end condition is a bit tricky, your loop should probably look a bit like:
char* begin; // Set to point to the mmap'ed data... // You'll also have to arrange for a '\0' // to follow the data. This is probably // the most difficult issue. char* end; errno = 0; double tmp = strtod( begin, &end ); while ( errno == 0 && end != begin ) { // do whatever with tmp... begin = end; tmp = strtod( begin, &end ); }
If none of these are fast enough, you'll have to consider the actual data. It probably has some sort of additional constraints, which means that you can potentially write a conversion routine which is faster than the more general ones; e.g. strtod
has to handle both fixed and scientific, and it has to be 100% accurate even if there are 17 significant digits. It also has to be locale specific. All of this is added complexity, which means added code to execute. But beware: writing an efficient and correct conversion routine, even for a restricted set of input, is non-trivial; you really do have to know what you are doing.
EDIT:
Just out of curiosity, I've run some tests. In addition to the afore mentioned solutions, I wrote a simple custom converter, which only handles fixed point (no scientific), with at most five digits after the decimal, and the value before the decimal must fit in an int
:
double convert( char const* source, char const** endPtr ) { char* end; int left = strtol( source, &end, 10 ); double results = left; if ( *end == '.' ) { char* start = end + 1; int right = strtol( start, &end, 10 ); static double const fracMult[] = { 0.0, 0.1, 0.01, 0.001, 0.0001, 0.00001 }; results += right * fracMult[ end - start ]; } if ( endPtr != nullptr ) { *endPtr = end; } return results; }
(If you actually use this, you should definitely add some error handling. This was just knocked up quickly for experimental purposes, to read the test file I'd generated, and nothing else.)
The interface is exactly that of strtod
, to simplify coding.
I ran the benchmarks in two environments (on different machines, so the absolute values of any times aren't relevant). I got the following results:
Under Windows 7, compiled with VC 11 (/O2):
Testing Using fstream directly (5 iterations)... 6.3528e+006 microseconds per iteration Testing Using fscan directly (5 iterations)... 685800 microseconds per iteration Testing Using strtod (5 iterations)... 597000 microseconds per iteration Testing Using manual (5 iterations)... 269600 microseconds per iteration
Under Linux 2.6.18, compiled with g++ 4.4.2 (-O2, IIRC):
Testing Using fstream directly (5 iterations)... 784000 microseconds per iteration Testing Using fscanf directly (5 iterations)... 526000 microseconds per iteration Testing Using strtod (5 iterations)... 382000 microseconds per iteration Testing Using strtof (5 iterations)... 360000 microseconds per iteration Testing Using manual (5 iterations)... 186000 microseconds per iteration
In all cases, I'm reading 554000 lines, each with 3 randomly generated floating point in the range [0...10000)
.
The most striking thing is the enormous difference between fstream
and fscan
under Windows (and the relatively small difference between fscan
and strtod
). The second thing is just how much the simple custom conversion function gains, on both platforms. The necessary error handling would slow it down a little, but the difference is still significant. I expected some improvement, since it doesn't handle a lot of things the the standard conversion routines do (like scientific format, very, very small numbers, Inf and NaN, i18n, etc.), but not this much.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With