Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Truly asynchronous file IO in C++

I have a super fast M.2 drive. How fast is it? It doesn’t matter because I cannot utilize this speed anyway. That’s why I’m asking this question.

I have an app that needs a real lot of memory. So much that it won’t fit in RAM. Fortunately it is not needed all at once. Instead it is used to save intermediate results from computations.

Unfortunately the application is not able to write and reads this data fast enough. I tried using multiple reader and writer threads but it only made it worse (later I read that it is because of this).

So my question is: Is it possible to have truly asynchronous file IO in C++ to fully exploit those advertised gigabytes per second? If it is than how (in a cross platform way)?

You could also recommend a library that’s good with tasks like that if you know one because I believe that there is no point in reinventing the wheel.

Edit:

Here is code that shows how I do file IO in my program. It isn't from the mentioned program because it wouldn't be that minimal. This one ilustrates the problem nevertheless. Do not mind Windows.h. It is used only to set thread affinity. In the actual program I also set affinity , so that's why I included it.

#include <fstream>
#include <thread>
#include <memory>
#include <string>

#include <Windows.h> // for SetThreadAffinityMask()

void stress_write(unsigned bytes, int num)
{
    std::ofstream out("temp" + std::to_string(num));
    for (unsigned i = 0; i < bytes; ++i)
    {
        out << char(i);
    }
}

void lock_thread(unsigned core_idx)
{
    SetThreadAffinityMask(GetCurrentThread(), 1LL << core_idx);
}

int main()
{
    std::ios_base::sync_with_stdio(false);
    lock_thread(0);

    auto worker_count = std::thread::hardware_concurrency() - 1;

    std::unique_ptr<std::thread[]> threads = std::make_unique<std::thread[]>(worker_count); // faster than std::vector

    for (int i = 0; i < worker_count; ++i)
    {
        threads[i] = std::thread(
            [](unsigned idx) {
                lock_thread(idx);
                stress_write(1'000'000'000, idx);
            },
            i + 1
        );
    }
    stress_write(1'000'000'000, 0);

    for (int i = 0; i < worker_count; ++i)
    {
        threads[i].join();
    }
}

As you can see its just plain old fstream. On my machine this uses 100% CPU, but only 7-9% disk (around 190MB/s). I am wondering if it could be increased.

like image 686
janekb04 Avatar asked Jan 09 '20 15:01

janekb04


1 Answers

The easiest thing to get (up to) a 10x speed up is to change this:

void stress_write(unsigned bytes, int num)
{
  std::ofstream out("temp" + std::to_string(num));
  for (unsigned i = 0; i < bytes; ++i)
  {
    out << char(i);
  }
}

to this:

void stress_write(unsigned bytes, int num)
{
  constexpr auto chunk_size = (1u << 12u); // tune as needed
  std::ofstream out("temp" + std::to_string(num));
  for (unsigned chunk = 0; chunk < (bytes+chunk_size-1)/chunk_size; ++chunk)
  {
    char chunk_buff[chunk_size];
    auto count = (std::min)( bytes - chunk_size*chunk, chunk_size );
    for (unsigned j = 0; j < count; ++j)
    {
      unsigned i = j + chunk_size*chunk;
      chunk_buff[j] = char(i); // processing
    }
    out.write( chunk_buff, count );
  }
}

where we group writes up to 4096 bytes before sending to the std ofstream.

The streaming operations have a number of annoying, hard for compilers to elide, virtual calls that dominate performance when you are writing only a handful of bytes at a time.

By chunking data into larger pieces we make the vtable lookups rare enough that they no longer dominate.

See this SO post for more details asto why.


To get the last iota of performance, you may have to use something like boost.asio or access your platforms raw async file io libraries.

But when you are working at < 10% of the drive bandwidth while railing your CPU, aim at low hanging fruit first.

like image 131
Yakk - Adam Nevraumont Avatar answered Nov 17 '22 10:11

Yakk - Adam Nevraumont