Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a huge file into a C++ vector in Ubuntu (linux OS)

In my C++ program running in linux (Ubuntu 14.4), I need to read a 90 GB file completely buffered in a C++ vector, and I have only 125 GB memory.

When I read the file chunk by chunk, it continuously results in increase in cached mem usage in linux, which turns out to be more than 50% of the 128 GB mem, then the free memory easily becomes under 50 GB.

              total        used        free      shared  buff/cache   available
Mem:            125          60         0           0          65         65

Swap: 255 0 255

So I found that the free memory then becomes zero, and the file reading process almost got stopped, and I have to manually run:

echo 3 | sudo tee /proc/sys/vm/drop_caches

to clear cached mem, so that the file reading process resumes. I understand cached mem is to speed up reading a file again. My question is how could I avoid manually running the drop cache command to ensure the file reading process would successfully complete?

like image 699
Ren Chen Avatar asked Jul 07 '17 09:07

Ren Chen


People also ask

How do I read large files in Linux?

When you are new to Linux, you try to use the cat command all the time to read the content of a file. This works great for files with only a few lines out of output, but larger files quickly scroll content past the user making it difficult, or even impossible for you to find what you need.

How to view large files in Linux using cat?

The cat command is certainly not very practical for viewing large files. You don’t want your entire screen to be filled with the file content. What you can do here is to use either ‘less command’ or ‘more command’. We’ve already covered the less command on Linux Handbook so I am going to show you how to use more command in this tutorial.

How to compile and run C program in Linux?

Step 1: You write your program and save the file with a .c extension. For example, my_program.c. Step 2: You compile the program and generate the object file using gcc compiler in a terminal like this: Step 3: You run the generated object file to run your C program in Linux: This was just the quick summary on how to compile ...

How to use the find command in Linux to sort files?

The find command also can be used in combination with other tools such as ls or sort to perform operations on those files. In the example below, we are passing the output of the find command to ls which will print the size of each found file and then pipe that output to the sort command to sort it based on the 5th column which is the file size.


1 Answers

Since you are simply streaming the data and never rereading it, the page cache does you no good whatsoever. In fact, given the amount of data you're pushing through the page cache, and the memory pressure from your application, otherwise useful data is likely evicted from the page cache and your system performance suffers because of that.

So don't use the cache when reading your data. Use direct IO. Per the Linux open() man page:

O_DIRECT (since Linux 2.4.10)

Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user- space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.

...

NOTES

...

O_DIRECT

The O_DIRECT flag may impose alignment restrictions on the length and address of user-space buffers and the file offset of I/Os. In Linux alignment restrictions vary by filesystem and kernel version and might be absent entirely. However there is currently no filesystem-independent interface for an application to discover these restrictions for a given file or filesystem. Some filesystems provide their own interfaces for doing so, for example the XFS_IOC_DIOINFO operation in xfsctl(3).

Under Linux 2.4, transfer sizes, and the alignment of the user buffer and the file offset must all be multiples of the logical block size of the filesystem. Since Linux 2.6.0, alignment to the logical block size of the underlying storage (typically 512 bytes) suffices. The logical block size can be determined using the ioctl(2) BLKSSZGET operation or from the shell using the command:

      blockdev --getss

...

Since you are not reading the data over and over, direct IO is likely to improve performance somewhat, as the data will go directly from disk into your application's memory instead of from disk, to the page cache, and then into your application's memory.

Use low-level, C-style I/O with open()/read()/close(), and open the file with the O_DIRECT flag:

int fd = ::open( filename, O_RDONLY | O_DIRECT );

This will result in the data being read directly into the application's memory, without being cached in the system's page cache.

You'll have to read() using aligned memory, so you'll need something like this to actually read the data:

char *buffer;
size_t pageSize = sysconf( _SC_PAGESIZE );
size_t bufferSize = 32UL * pageSize;

int rc = ::posix_memalign( ( void ** ) &buffer, pageSize, bufferSize );

posix_memalign() is a POSIX-standard function that returns a pointer to memory aligned as requested. Page-aligned buffers are usually more than sufficient, but aligning to hugepage size (2MiB on x86-64) will hint the kernel that you want transparent hugepages for that allocation, making access to your buffer more efficient when you read it later.

ssize_t bytesRead = ::read( fd, buffer, bufferSize );

Without your code, I can't say how to get the data from buffer into your std::vector, but it shouldn't be hard. There are likely ways to wrap the C-style low-level file descriptor with a C++ stream of some type, and to configure that stream to use memory properly aligned for direct IO.

If you want to see the difference, try this:

echo 3 | sudo tee /proc/sys/vm/drop_caches
dd if=/your/big/data/file of=/dev/null bs=32k

Time that. Then look at the amount of data in the page cache.

Then do this:

echo 3 | sudo tee /proc/sys/vm/drop_caches
dd if=/your/big/data/file iflag=direct of=/dev/null bs=32k

Check the amount of data in the page cache after that...

You can experiment with different block sizes to see what works best on your hardware and filesystem.

Note well, though, that direct IO is very implementation-dependent. Requirements to perform direct IO can vary significantly between different filesystems, and performance can vary drastically depending on your IO pattern and your specific hardware. Most of the time it's not worth those dependencies, but the one simple use where it usually is worthwhile is streaming a huge file without rereading/rewriting any part of the data.

like image 154
Andrew Henle Avatar answered Oct 17 '22 19:10

Andrew Henle