Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to read 10 GB file from the disk?

We need to read and count different types of messages/run some statistics on a 10 GB text file, e.g a FIX engine log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but the language doesn't really matter.

I have found some interesting tips in Tim Bray's WideFinder project. However, we've found that using memory mapping is inherently limited by the 32 bit architecture.

We tried using multiple processes, which seems to work faster if we process the file in parallel using 4 processes on 4 CPUs. Adding multi-threading slows it down, maybe because of the cost of context switching. We tried changing the size of thread pool, but that is still slower than simple multi-process version.

The memory mapping part is not very stable, sometimes it takes 80 sec and sometimes 7 sec on a 2 GB file, maybe from page faults or something related to virtual memory usage. Anyway, Mmap cannot scale beyond 4 GB on a 32 bit architecture.

We tried Perl's IPC::Mmap and Sys::Mmap. Looked into Map-Reduce as well, but the problem is really I/O bound, the processing itself is sufficiently fast.

So we decided to try optimize the basic I/O by tuning buffering size, type, etc.

Can anyone who is aware of an existing project where this problem was efficiently solved in any language/platform point to a useful link or suggest a direction?

like image 987
alex Avatar asked Aug 28 '09 22:08

alex


People also ask

What is are the fastest way to read data?

Detailed SolutionRAM is the fastest to read from and write to than the other kinds of storage in a computer. It is Volatile or temporary memory. Data gets erased when the power supply is off.

Is read faster than fread?

Conclusion: For sequential access, both fread and ifstream are equally fast. Unbuffered IO (read) is slower, as expected.


1 Answers

Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don't waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don't they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU's (if you don't use some sort of RAID 0 or NAS or ...).

Measure, don't assume. Don't forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.

like image 63
Hynek -Pichi- Vychodil Avatar answered Oct 07 '22 16:10

Hynek -Pichi- Vychodil