How to access directly and efficiently on very large text file?

Tags:

I have a very large text files (+10GB) which i want to read for some data mining technics. To do that, i use parallel technics with MPI so many processes can access together to the same file.
In fact, i want that each process read N number of lines. Since the file is not structured (same number of fields but each field can contain different number of characters), i'm in the obligation to parse the file and that is not parallel and it takes a lot of time. Is there any way to access directly to a specific number of line withount parsing and counting the lines? Thank you for you help.

320

asked Apr 30 '12 08:04

ezzakrem

2 Answers

If your file isn't otherwise indexed, there is no direct way.

Indexing it might be worth it (scan it once to find all the line endings, and store the offsets of each line or chunk of lines). If you need to process the file multiple times, and it does not change, the cost of indexing it could be offset by the ease of using the index for further runs.

Otherwise, if you don't need all the jobs to have exactly the same number of lines/items, you could just fudge it.
Seek to a given offset (say 1G), and look for the closest line separator. Repeat at offset 2G, etc. until you've found enough break points.

You can then fire off your parallel tasks on each of the chunks you've identified.

151

answered Oct 10 '22 17:10

Mat

A few other options beyond what has been mentioned here that will not require scanning the whole file:

make a master process that pushes lines via pipes/fifos to child processes that do the actual processing. This might be a bit slower but if say 90% of the time spent in the subprocesses is the actual crunching of texts, it should be ok.
A stupid but effective trick: say you have N processes, and you can tell each process by argv or something it's "serial number" e.g. processor -serial_number [1|2|3...N] -num_procs N, they can all read the same data, but process only lines that have lineno % num_procs == serial_number. it's a bit less efficient because they will all read the entire data, but again, if they only work on every Nth line, and that is what consumes most of the time, you should be fine.

answered Oct 10 '22 15:10

Not_a_Golfer

Related questions
                            
                                Curly braces or colon-endif statements in PHP - which one provides better performance and code compatibility?
                            
                                Changing iPad/iPhone simulator resolution for Xcode 4.3.2 [duplicate]
                            
                                $PWD vs. pwd regarding portability
                            
                                Enabling refreshing for specific html elements only
                            
                                How to suppress multiple FindBugs warnings for the same line of code
                            
                                How do I use reflection to invoke a method with parameters?
                            
                                Passing member functions to std::thread [duplicate]
                            
                                Center Align Title Google Chart
                            
                                What is the idiomatic way to pattern match sequence comprehensions?
                            
                                How to generate zoom/pinch gesture for testing for Android
                            
                                Lighttpd 403 Forbidden for PHP files
                            
                                Bag of words training and testing opencv, matlab

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With