What is the fastest way to read 10 GB file from the disk?

Tags:

We need to read and count different types of messages/run some statistics on a 10 GB text file, e.g a FIX engine log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but the language doesn't really matter.

I have found some interesting tips in Tim Bray's WideFinder project. However, we've found that using memory mapping is inherently limited by the 32 bit architecture.

We tried using multiple processes, which seems to work faster if we process the file in parallel using 4 processes on 4 CPUs. Adding multi-threading slows it down, maybe because of the cost of context switching. We tried changing the size of thread pool, but that is still slower than simple multi-process version.

The memory mapping part is not very stable, sometimes it takes 80 sec and sometimes 7 sec on a 2 GB file, maybe from page faults or something related to virtual memory usage. Anyway, Mmap cannot scale beyond 4 GB on a 32 bit architecture.

We tried Perl's IPC::Mmap and Sys::Mmap. Looked into Map-Reduce as well, but the problem is really I/O bound, the processing itself is sufficiently fast.

So we decided to try optimize the basic I/O by tuning buffering size, type, etc.

Can anyone who is aware of an existing project where this problem was efficiently solved in any language/platform point to a useful link or suggest a direction?

987

asked Aug 28 '09 22:08

alex

1 Answers

Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don't waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don't they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU's (if you don't use some sort of RAID 0 or NAS or ...).

Measure, don't assume. Don't forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.

answered Oct 07 '22 16:10

Hynek -Pichi- Vychodil

Related questions
                            
                                Object as hash key
                            
                                Where does CPAN install Perl modules in OS X?
                            
                                Simulating host unreachable - how to achieve/implement it
                            
                                Best way to capture output from system command to a text file?
                            
                                Perl regex single quote
                            
                                Finding if the system is little endian or big endian with perl
                            
                                How can I use Perl to grab text from a web page that is dynamically generated with JavaScript?
                            
                                Hash Constants in Perl
                            
                                What do Perl functions that return Boolean actually return
                            
                                How to pipe stdin into a perl script that is looking for input as the only parameter?
                            
                                Why is Perl inconsistent with sprintf rounding?
                            
                                How can I create a portable perl when I can't install modules on the target host?
                            
                                What causes "suexec policy violation" when Perl is called via server side include?
                            
                                Appending a prefix when using join in Perl
                            
                                How can I make Perl die if a warning is generated?
                            
                                Why doesn't "use overload" work with "use namespace:autoclean"?
                            
                                Is there a multiprocessing module for Perl?
                            
                                Why is "use" not allowed, as in "use strict;" in Perl 5.14?
                            
                                Is there a less clumsy alternative for copying up to "n" array elements?
                            
                                How can I copy a directory recursively and filter filenames in Perl?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the fastest way to read 10 GB file from the disk?

Tags:

performance

io

perl

mmap

alex

People also ask

1 Answers

Hynek -Pichi- Vychodil

Recent Activity

Donate For Us