I am parsing a ~500GB log file and my C++ version takes 3.5 minutes and my Go version takes 1.2 minutes.
I am using C++'s streams to stream each line of the file in to parse.
#include <fstream>
#include <string>
#include <iostream>
int main( int argc , char** argv ) {
int linecount = 0 ;
std::string line ;
std::ifstream infile( argv[ 1 ] ) ;
if ( infile ) {
while ( getline( infile , line ) ) {
linecount++ ;
}
std::cout << linecount << ": " << line << '\n' ;
}
infile.close( ) ;
return 0 ;
}
Firstly, why is it so slow to use this code? Secondly, how can I improve it to make it faster?
The C++ standard libraries iostreams
are notoriously slow and this is the case for all different implementations of the standard library. Why? Because the standard imposes lots of requirements on the implementation which inhibit best performance. This part of the standard library was designed roughly 20 years ago and is not really competitive on high performance benchmarks.
How can you avoid it? Use other libraries for high performance async I/O like boost asio or native functions that are provided by your OS.
If you want to stay within the standard, the functionstd::basic_istream::read()
may satisfy your performance demands. But you have to do your buffering and line counting yourself in this case. Here's how it can be done.
#include <algorithm>
#include <fstream>
#include <iostream>
#include <vector>
int main( int, char** argv ) {
int linecount = 1 ;
std::vector<char> buffer;
buffer.resize(1000000); // buffer of 1MB size
std::ifstream infile( argv[ 1 ] ) ;
while (infile)
{
infile.read( buffer.data(), buffer.size() );
linecount += std::count( buffer.begin(),
buffer.begin() + infile.gcount(), '\n' );
}
std::cout << "linecount: " << linecount << '\n' ;
return 0 ;
}
Let me know, if it's faster!
Building on @Ralph Tandetzky answer but going down to low-level C IO functions, and assuming a Linux platform using a filesystem that provides good direct IO support (but staying single-threaded):
#define BUFSIZE ( 1024UL * 1024UL )
int main( int argc, char **argv )
{
// use direct IO - the page cache only slows this down
int fd = ::open( argv[ 1 ], O_RDONLY | O_DIRECT );
// Direct IO needs page-aligned memory
char *buffer = ( char * ) ::valloc( BUFSIZE );
size_t newlines = 0UL;
// avoid any conditional checks in the loop - have to
// check the return value from read() anyway, so use that
// to break the loop explicitly
for ( ;; )
{
ssize_t bytes_read = ::read( fd, buffer, BUFSIZE );
if ( bytes_read <= ( ssize_t ) 0L )
{
break;
}
// I'm guessing here that computing a boolean-style
// result and adding it without an if statement
// is faster - might be wrong. Try benchmarking
// both ways to be sure.
for ( size_t ii = 0; ii < bytes_read; ii++ )
{
newlines += ( buffer[ ii ] == '\n' );
}
}
::close( fd );
std::cout << "newlines: " << newlines << endl;
return( 0 );
}
If you really need to go even faster, use multiple threads to read and count newlines so you're reading data while you're counting newlines. But if you're not running on really fast hardware designed for high performance, this is overkill.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With