Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a faster way to load a file in C++ using the command line?

I want to load one million random integers from a .txt into a vector using the command line:

program.exe < million-integers.txt

My code below works, but takes several seconds to run. Is there something I can do to make it faster? I've found some solutions on SO, but they all seem to relying on hard-coding the filepath. I want to be able to pass a filename through the command line.

vector<int> data;
int input;

while (cin >> input)
{
    data.push_back(input);
}
cout << "Data loaded." << endl;

(C++ noob using Visual Studio on Win 8.1)

Edit: In this case, I know that some improvement can be made since I have someone else's .exe that can do it in under a second.

Edit: All integers are on the same line.

like image 540
Jim V Avatar asked Dec 09 '22 07:12

Jim V


1 Answers

Run time: 4.08s. What? That's slow!

Why is this happening?

I did profiling. I'm using a very different system: OS X 10.8, with Clang, but my program is also slow, and I suspect it is for the same reason. Here are two lines from the profiling results (apologies for formatting):

Running Time    Self        Symbol Name
3389.0ms   79.3%    76.0             std::__1::num_get<char, std::__1::istreambuf_iterator<char, std::__1::char_traits<char> > >::do_get(std::__1::istreambuf_iterator<char, std::__1::char_traits<char> >, std::__1::istreambuf_iterator<char, std::__1::char_traits<char> >, std::__1::ios_base&, unsigned int&, long&) const
824.0ms   19.2% 8.0          std::__1::basic_istream<char, std::__1::char_traits<char> >::sentry::sentry(std::__1::basic_istream<char, std::__1::char_traits<char> >&, bool)

As you can see, these two functions account for almost 98.5% of the execution time. Wow! When I drill down, what are these library functions calling that takes so much time?

  • flockfile()
  • funlockfile()
  • pthread_mutex_unlock()

So, on my system, the implementation for std::cin works with C's <stdio.h> functions so they can both be used in the same program, and these functions make sure to synchronize with other threads. This is inefficient.

  1. There is no code using <stdio.h>, so there is no need synchronize with it.

  2. There is only one thread using stdin, so locking is excessive, especially if you lock once per character read. That's super excessive. Locks and system calls are pretty fast... but if you do something like 10 million locks and system calls? No longer fast.

Note: Yes, I am running OS X, and the actual functions will be different on Windows. Instead of flockfile() and pthread_mutex_unlock() you will see whatever the Windows version is.

Solution #1

Stop using redirection. If you use ifstream, then it is assumed that you will take care of locking yourself. On my system, this give a runtime of 0.42 seconds—close to a factor of 10.

Solution #2

Read everything into a string, and then parse the string. This allows you to continue using redirection to read the file.

Solution #3

Disable locking on std::cin. Sorry folks, I don't know how to do that. It might be possible.

Performance limits

I suspect the ifstream version is nowhere near the performance limit of your computer. If performance were critical, I suspect you can get the warm-cache runtime close to 2 or 3 ms, when your program is only limited by memory bandwidth.

like image 119
Dietrich Epp Avatar answered Feb 03 '23 11:02

Dietrich Epp