I have a program which performs some operations on a lot of files (> 10 000). It spawns N worker threads and each thread mmaps some file, does some work and munmaps it.
The problem I am facing right now is that whenever I use just 1 process with N worker threads, it has worse performance than spawning 2 processes each with N/2 worker threads. I can see this in iotop
because 1 process+N threads uses only around 75% of the disk bandwidth whereas 2 processes+N/2 threads use full bandwidth.
Some notes:
MADV_SEQUENTIAL
but it doesn't seem to change anything (or it just slows it down) if I remove it or change the advise argument.htop
seems to be the same even in both cases.So my questions are:
EDIT:
#include <condition_variable>
#include <deque>
#include <filesystem>
#include <iostream>
#include <mutex>
#include <thread>
#include <vector>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#ifndef WORKERS
#define WORKERS 16
#endif
bool stop = false;
std::mutex queue_mutex;
std::condition_variable queue_cv;
std::pair<const std::uint8_t*, std::size_t> map_file(const std::string& file_path)
{
int fd = open(file_path.data(), O_RDONLY);
if (fd != -1)
{
auto dir_ent = std::filesystem::directory_entry{file_path.data()};
if (dir_ent.is_regular_file())
{
auto size = dir_ent.file_size();
auto data = mmap(nullptr, size, PROT_READ, MAP_PRIVATE, fd, 0);
madvise(data, size, MADV_SEQUENTIAL);
close(fd);
return { reinterpret_cast<const std::uint8_t*>(data), size };
}
close(fd);
}
return { nullptr, 0 };
}
void unmap_file(const std::uint8_t* data, std::size_t size)
{
munmap((void*)data, size);
}
int main(int argc, char* argv[])
{
std::deque<std::string> queue;
std::vector<std::thread> threads;
for (std::size_t i = 0; i < WORKERS; ++i)
{
threads.emplace_back(
[&]() {
std::string path;
while (true)
{
{
std::unique_lock<std::mutex> lock(queue_mutex);
while (!stop && queue.empty())
queue_cv.wait(lock);
if (stop && queue.empty())
return;
path = queue.front();
queue.pop_front();
}
auto [data, size] = map_file(path);
std::uint8_t b = 0;
for (auto itr = data; itr < data + size; ++itr)
b ^= *itr;
unmap_file(data, size);
std::cout << (int)b << std::endl;
}
}
);
}
for (auto& p : std::filesystem::recursive_directory_iterator{argv[1]})
{
std::unique_lock<std::mutex> lock(queue_mutex);
if (p.is_regular_file())
{
queue.push_back(p.path().native());
queue_cv.notify_one();
}
}
stop = true;
queue_cv.notify_all();
for (auto& t : threads)
t.join();
return 0;
}
Is there anything about
mmap()
I am not aware of when used in multithreaded environment?
Yes. mmap()
requires significant virtual memory manipulation - effectively single-threading your process in places. Per this post from one Linus Torvalds:
... playing games with the virtual memory mapping is very expensive in itself. It has a number of quite real disadvantages that people tend to ignore because memory copying is seen as something very slow, and sometimes optimizing that copy away is seen as an obvious improvment.
Downsides to mmap:
quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Note that much of the above also has to be single-threaded across the entire machine, such as the actual mapping of physical memory.
So the virtual memory manipulations mapping files requires are not only expensive, they really can't be done in parallel - there's only one chunk of actual physical memory that the kernel has to keep track of, and multiple threads can't parallelize changes to a process's virtual address space.
You'd almost certainly get better performance reusing a memory buffer for each file, where each buffer is created once and is large enough to hold any file read into it, then reading from the file using low-level POSIX read()
call(s). You might want to experiment with using page-aligned buffers and using direct IO by calling open()
with the O_DIRECT
flag (Linux-specific) to bypass the page cache since you apparently never re-read any data and any caching is a waste of memory and CPU cycles.
Reusing the buffer also completely eliminates any munmap()
or delete
/free()
.
You'd have to manage the buffers, though. Perhaps prepopulating a queue with N precreated buffers, and returning a buffer to the queue when done with a file?
As far as
If so, why do 2 processes have better performance?
The use of two processes splits the process-specific virtual memory manipulations caused by mmap()
calls into two separable sets that can run in parallel.
A few notes:
perf stat -ddd <app>
and have a look at context-switches, cpu-migrations and page-faults numbers.vm_area_struct
in the kernel process structure on mmap
and page faults. Try passing MAP_POPULATE
or MAP_LOCKED
flag into mmap
to minimize page faults. Alternatively, try mmap
with MAP_POPULATE
or MAP_LOCKED
flag in the main thread only (you may like to ensure that all threads run on the same NUMA node in this case).MAP_HUGETLB
and one of MAP_HUGE_2MB, MAP_HUGE_1GB
flags.numactl
to make sure that threads only access local NUMA memory. E.g. numactl --membind=0 --cpunodebind=0 <app>
.stop = true
, otherwise the condition variable notification can get lost and deadlock the waiting thread forever.p.is_regular_file()
check doesn't require the mutex to be locked.std::deque
can be replaced with std::list
and use splice
to push and pop elements to minimize the time the mutex is locked.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With