Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to access directly and efficiently on very large text file?

Tags:

I have a very large text files (+10GB) which i want to read for some data mining technics. To do that, i use parallel technics with MPI so many processes can access together to the same file.
In fact, i want that each process read N number of lines. Since the file is not structured (same number of fields but each field can contain different number of characters), i'm in the obligation to parse the file and that is not parallel and it takes a lot of time. Is there any way to access directly to a specific number of line withount parsing and counting the lines? Thank you for you help.

like image 320
ezzakrem Avatar asked Apr 30 '12 08:04

ezzakrem


People also ask

How do I read a 100 GB file in Python?

Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.


2 Answers

If your file isn't otherwise indexed, there is no direct way.

Indexing it might be worth it (scan it once to find all the line endings, and store the offsets of each line or chunk of lines). If you need to process the file multiple times, and it does not change, the cost of indexing it could be offset by the ease of using the index for further runs.

Otherwise, if you don't need all the jobs to have exactly the same number of lines/items, you could just fudge it.
Seek to a given offset (say 1G), and look for the closest line separator. Repeat at offset 2G, etc. until you've found enough break points.

You can then fire off your parallel tasks on each of the chunks you've identified.

like image 151
Mat Avatar answered Oct 10 '22 17:10

Mat


A few other options beyond what has been mentioned here that will not require scanning the whole file:

  1. make a master process that pushes lines via pipes/fifos to child processes that do the actual processing. This might be a bit slower but if say 90% of the time spent in the subprocesses is the actual crunching of texts, it should be ok.

  2. A stupid but effective trick: say you have N processes, and you can tell each process by argv or something it's "serial number" e.g. processor -serial_number [1|2|3...N] -num_procs N, they can all read the same data, but process only lines that have lineno % num_procs == serial_number. it's a bit less efficient because they will all read the entire data, but again, if they only work on every Nth line, and that is what consumes most of the time, you should be fine.

like image 22
Not_a_Golfer Avatar answered Oct 10 '22 15:10

Not_a_Golfer