Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mmap: performance when using multithreading

I have a program which performs some operations on a lot of files (> 10 000). It spawns N worker threads and each thread mmaps some file, does some work and munmaps it.

The problem I am facing right now is that whenever I use just 1 process with N worker threads, it has worse performance than spawning 2 processes each with N/2 worker threads. I can see this in iotop because 1 process+N threads uses only around 75% of the disk bandwidth whereas 2 processes+N/2 threads use full bandwidth.

Some notes:

  • This happens only if I use mmap()/munmap(). I have tried to replace it with fopen()/fread() and it worked just fine. But since the mmap()/munmap() comes with 3rd party library, I would like to use it in its original form.
  • madvise() is called with MADV_SEQUENTIAL but it doesn't seem to change anything (or it just slows it down) if I remove it or change the advise argument.
  • Thread affinity doesn't seem to matter. I have tried to limit each thread to specific core. I have also tried to limit threads to core pairs (Hyper Threading). No results so far.
  • Load reported by htop seems to be the same even in both cases.

So my questions are:

  • Is there anything about mmap() I am not aware of when used in multithreaded environment?
  • If so, why do 2 processes have better performance?

EDIT:

  • As pointed out in the comments, it is running on server with 2xCPU. I should probably try to set thread affinities such that it is always running on the same CPU but I think I already tried that and it didn't work.
  • Here is a piece of code with which I can reproduce the same issue as with my production software.
#include <condition_variable>
#include <deque>
#include <filesystem>
#include <iostream>
#include <mutex>
#include <thread>
#include <vector>

#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>

#ifndef WORKERS
#define WORKERS 16
#endif

bool stop = false;
std::mutex queue_mutex;
std::condition_variable queue_cv;

std::pair<const std::uint8_t*, std::size_t> map_file(const std::string& file_path)
{
    int fd = open(file_path.data(), O_RDONLY);
    if (fd != -1)
    {
        auto dir_ent = std::filesystem::directory_entry{file_path.data()};
        if (dir_ent.is_regular_file())
        {
            auto size = dir_ent.file_size();
            auto data = mmap(nullptr, size, PROT_READ, MAP_PRIVATE, fd, 0);
            madvise(data, size, MADV_SEQUENTIAL);
            close(fd);
            return { reinterpret_cast<const std::uint8_t*>(data), size };
        }

        close(fd);
    }

    return { nullptr, 0 };
}

void unmap_file(const std::uint8_t* data, std::size_t size)
{
    munmap((void*)data, size);
}

int main(int argc, char* argv[])
{
    std::deque<std::string> queue;

    std::vector<std::thread> threads;
    for (std::size_t i = 0; i < WORKERS; ++i)
    {
        threads.emplace_back(
            [&]() {
                std::string path;

                while (true)
                {
                    {
                        std::unique_lock<std::mutex> lock(queue_mutex);
                        while (!stop && queue.empty())
                            queue_cv.wait(lock);
                        if (stop && queue.empty())
                            return;
                        path = queue.front();
                        queue.pop_front();
                    }

                    auto [data, size] = map_file(path);
                    std::uint8_t b = 0;
                    for (auto itr = data; itr < data + size; ++itr)
                        b ^= *itr;
                    unmap_file(data, size);

                    std::cout << (int)b << std::endl;
                }
            }
        );
    }

    for (auto& p : std::filesystem::recursive_directory_iterator{argv[1]})
    {
        std::unique_lock<std::mutex> lock(queue_mutex);
        if (p.is_regular_file())
        {
            queue.push_back(p.path().native());
            queue_cv.notify_one();
        }
    }

    stop = true;
    queue_cv.notify_all();

    for (auto& t : threads)
        t.join();

    return 0;
}

like image 692
Marek Milkovič Avatar asked May 03 '19 07:05

Marek Milkovič


2 Answers

Is there anything about mmap() I am not aware of when used in multithreaded environment?

Yes. mmap() requires significant virtual memory manipulation - effectively single-threading your process in places. Per this post from one Linus Torvalds:

... playing games with the virtual memory mapping is very expensive in itself. It has a number of quite real disadvantages that people tend to ignore because memory copying is seen as something very slow, and sometimes optimizing that copy away is seen as an obvious improvment.

Downsides to mmap:

  • quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.

  • page faulting is expensive. That's how the mapping gets populated, and it's quite slow.

Note that much of the above also has to be single-threaded across the entire machine, such as the actual mapping of physical memory.

So the virtual memory manipulations mapping files requires are not only expensive, they really can't be done in parallel - there's only one chunk of actual physical memory that the kernel has to keep track of, and multiple threads can't parallelize changes to a process's virtual address space.

You'd almost certainly get better performance reusing a memory buffer for each file, where each buffer is created once and is large enough to hold any file read into it, then reading from the file using low-level POSIX read() call(s). You might want to experiment with using page-aligned buffers and using direct IO by calling open() with the O_DIRECT flag (Linux-specific) to bypass the page cache since you apparently never re-read any data and any caching is a waste of memory and CPU cycles.

Reusing the buffer also completely eliminates any munmap() or delete/free().

You'd have to manage the buffers, though. Perhaps prepopulating a queue with N precreated buffers, and returning a buffer to the queue when done with a file?

As far as

If so, why do 2 processes have better performance?

The use of two processes splits the process-specific virtual memory manipulations caused by mmap() calls into two separable sets that can run in parallel.

like image 53
Andrew Henle Avatar answered Oct 27 '22 16:10

Andrew Henle


A few notes:

  1. Try running your application with perf stat -ddd <app> and have a look at context-switches, cpu-migrations and page-faults numbers.
  2. The threads probably contend for vm_area_struct in the kernel process structure on mmap and page faults. Try passing MAP_POPULATE or MAP_LOCKED flag into mmap to minimize page faults. Alternatively, try mmap with MAP_POPULATE or MAP_LOCKED flag in the main thread only (you may like to ensure that all threads run on the same NUMA node in this case).
  3. You may also like to experiment with MAP_HUGETLB and one of MAP_HUGE_2MB, MAP_HUGE_1GB flags.
  4. Try binding threads to the same NUMA node with numactl to make sure that threads only access local NUMA memory. E.g. numactl --membind=0 --cpunodebind=0 <app>.
  5. Lock the mutex before stop = true, otherwise the condition variable notification can get lost and deadlock the waiting thread forever.
  6. p.is_regular_file() check doesn't require the mutex to be locked.
  7. std::deque can be replaced with std::list and use splice to push and pop elements to minimize the time the mutex is locked.
like image 41
Maxim Egorushkin Avatar answered Oct 27 '22 17:10

Maxim Egorushkin