Bad Linux Memory Mapped File Performance with Random Access C++ & Python

While trying to use memory mapped files to create a multi-gigabyte file (around 13gb), I ran into what appears to be a problem with mmap(). The initial implementation was done in c++ on Windows using boost::iostreams::mapped_file_sink and all was well. The code was then run on Linux and what took minutes on Windows became hours on Linux.

The two machines are clones of the same hardware: Dell R510 2.4GHz 8M Cache 16GB Ram 1TB Disk PERC H200 Controller.

The Linux is Oracle Enterprise Linux 6.5 using the 3.8 kernel and g++ 4.83.

There was some concern that there may be a problem with the boost library, so implementations were done with boost::interprocess::file_mapping and again with native mmap(). All three show the same behavior. The Windows and Linux performance is on par to a certain point when the Linux performance falls off badly.

Full source code and performance numbers are linked below.

// C++ code using boost::iostreams
void IostreamsMapping(size_t rowCount)
{
   std::string outputFileName = "IoStreamsMapping.out";
   boost::iostreams::mapped_file_params params(outputFileName);
   params.new_file_size = static_cast<boost::iostreams::stream_offset>(sizeof(uint64_t) * rowCount);
   boost::iostreams::mapped_file_sink fileSink(params); // NOTE: using this form of the constructor will take care of creating and sizing the file.
   uint64_t* dest = reinterpret_cast<uint64_t*>(fileSink.data());
   DoMapping(dest, rowCount);
}

void DoMapping(uint64_t* dest, size_t rowCount)
{
   inputStream->seekg(0, std::ios::beg);
   uint32_t index, value;
   for (size_t i = 0; i<rowCount; ++i)
   {
      inputStream->read(reinterpret_cast<char*>(&index), static_cast<std::streamsize>(sizeof(uint32_t)));
      inputStream->read(reinterpret_cast<char*>(&value), static_cast<std::streamsize>(sizeof(uint32_t)));
      dest[index] = value;
   }
}

One final test was done in Python to reproduce this in another language. The fall off happened at the same place, so looks like the same problem.

# Python code using numpy
import numpy as np
fpr = np.memmap(inputFile, dtype='uint32', mode='r', shape=(count*2))
out = np.memmap(outputFile, dtype='uint64', mode='w+', shape=(count))
print("writing output")
out[fpr[::2]]=fpr[::2]

For the c++ tests Windows and Linux have similar performance up to around 300 million int64s (with Linux looking slightly faster). It looks like performance falls off on Linux around 3Gb (400 million * 8 bytes per int64 = 3.2Gb) for both C++ and Python.

I know on 32-bit Linux that 3Gb is a magic boundary, but am unaware of similar behavior for 64-bit Linux.

The gist of the results is 1.4 minutes for Windows becoming 1.7 hours on Linux at 400 million int64s. I am actually trying to map close to 1.3 billion int64s.

Can anyone explain why there is such a disconnect in performance between Windows and Linux?

Any help or suggestions would be greatly appreciated!

LoadTest.cpp

Makefile

LoadTest.vcxproj

updated mmap_test.py

original mmap_test.py

Updated Results With updated Python code...Python speed now comparable with C++

Original Results NOTE: The Python results are stale

Why is mmap slow?

Even though it is important and often used, mmap can be slow and inconsistent in its timing. Mmap maps memory pages directly to bytes on disk. With mmap, whenever there is a pagefault, the kernel pulls the page directly from disk into the program's memory space.

Why is mmap faster than read?

What mmap helps with is that there is no extra user space buffer involved, the "read" takes place there where the OS kernel sees fit and in chunks that can be optimized. This may be an advantage in speed, but first of all this is just an interface that is easier to use.

Are memory mapped files faster?

Accessing memory mapped files is faster than using direct read and write operations for two reasons. Firstly, a system call is orders of magnitude slower than a simple change to a program's local memory.

What is mmap offset?

The mmap function creates a new mapping, connected to bytes ( offset ) to ( offset + length - 1) in the file open on filedes . A new reference for the file specified by filedes is created, which is not removed by closing the file. address gives a preferred starting address for the mapping. NULL expresses no preference.

Edit: Upgrading to "proper answer". The problem is with the way that "dirty pages" are handled by Linux. I still want my system to flush dirty pages now and again, so I didn't allow it to have TOO many outstanding pages. But at the same time, I can show that this is what is going on.

I did this (with "sudo -i"):

# echo 80 > /proc/sys/vm/dirty_ratio
# echo 60 > /proc/sys/vm/dirty_background_ratio

Which gives these settings VM dirty settings:

grep ^ /proc/sys/vm/dirty*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:60
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:80
/proc/sys/vm/dirty_writeback_centisecs:500

This makes my benchmark run like this:

$ ./a.out m64 200000000
Setup Duration 33.1042 seconds
Linux: mmap64
size=1525 MB
Mapping Duration 30.6785 seconds
Overall Duration 91.7038 seconds

Compare with "before":

$ ./a.out m64 200000000
Setup Duration 33.7436 seconds
Linux: mmap64
size=1525
Mapping Duration 1467.49 seconds
Overall Duration 1501.89 seconds

which had these VM dirty settings:

grep ^ /proc/sys/vm/dirty*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:20
/proc/sys/vm/dirty_writeback_centisecs:500

I'm not sure exactly what settings I should use to get IDEAL performance whilst still not leaving all dirty pages sitting around in memory forever (meaning that if the system crashes, it takes much longer to write out to disk).

For history: Here's what I originally wrote as a "non-answer" - some comments here still apply...

Not REALLY an answer, but I find it rather interesting that if I change the code to first read the entire array, and the write it out, it's SIGNIFICANTLY faster, than doing both in the same loop. I appreciate that this is utterly useless if you need to deal with really huge data sets (bigger than memory). With the original code as posted, the time for 100M uint64 values is 134s. When I split the read and the write cycle, it's 43s.

This is the DoMapping function [only code I've changed] after modification:

struct VI
{
    uint32_t value;
    uint32_t index;
};


void DoMapping(uint64_t* dest, size_t rowCount)
{
   inputStream->seekg(0, std::ios::beg);
   std::chrono::system_clock::time_point startTime = std::chrono::system_clock::now();
   uint32_t index, value;
   std::vector<VI> data;
   for(size_t i = 0; i < rowCount; i++)
   {
       inputStream->read(reinterpret_cast<char*>(&index), static_cast<std::streamsize>(sizeof(uint32_t)));
       inputStream->read(reinterpret_cast<char*>(&value), static_cast<std::streamsize>(sizeof(uint32_t)));
       VI d = {index, value};
       data.push_back(d);
   }
   for (size_t i = 0; i<rowCount; ++i)
   {
       value = data[i].value;
       index = data[i].index;
       dest[index] = value;
   }
   std::chrono::duration<double> mappingTime = std::chrono::system_clock::now() - startTime;
   std::cout << "Mapping Duration " << mappingTime.count() << " seconds" << std::endl;
   inputStream.reset();
}

I'm currently running a test with 200M records, which on my machine takes a significant amount of time (2000+ seconds without code-changes). It is very clear that the time taken is from disk-I/O, and I'm seeing IO-rates of 50-70MB/s, which is pretty good, as I don't really expect my rather unsophisticated setup to deliver much more than that. The improvement is not as good with the larger size, but still a decent improvement: 1502s total time, vs 2021s for the "read and write in the same loop".

Also, I'd like to point out that this is a rather terrible test for any system - the fact that Linux is notably worse than Windows is beside the point - you do NOT really want to map a large file and write 8 bytes [meaning the 4KB page has to be read in] to each page at random. If this reflects your REAL application, then you seriously should rethink your approach in some way. It will run fine when you have enough free memory that the whole memory-mapped region fits in RAM.

There is plenty of RAM in my system, so I believe that the problem is that Linux doesn't like too many mapped pages that are "dirty".

I have a feeling that this may have something to do with it: https://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages More explanation: http://www.westnet.com/~gsmith/content/linux-pdflush.htm

Unfortunately, I'm also very tired, and need to sleep. I'll see if I can experiment with these tomorrow - but don't hold your breath. Like I said, this is not REALLY an answer, but rather a long comment that doesn't really fit in a comment (and contains code, which is completely rubbish to read in a comment)

Bad Linux Memory Mapped File Performance with Random Access C++ & Python

Tags:

c++

python

linux

mmap

shao.lo

People also ask

1 Answers

Mats Petersson

Recent Activity

Donate For Us

Bad Linux Memory Mapped File Performance with Random Access C++ & Python

Tags:

c++

python

linux

mmap

shao.lo

People also ask

1 Answers

Mats Petersson

Related questions

Recent Activity

Donate For Us