Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

process a large file via multithreading

There is a quite large file (>10G) on the disk, each line inside the fie is composed of a line-number and a person's name, like this:

1 Jane
2 Perk
3 Sime
4 Perk
.. ..

I have to read this large file, and find the frequency of each name, finally output the results in descending order of each name's frequency, like this:

Perk 2
Jane 1
Sime 1

As the interviewer requested, the above job should be done as efficiently as possible, and multithreading is allowed. And my solution is something like this:

  1. Because the file is too large, I partition the file into several small files, each small file is about 100M, via lseek I can locate the begin and the end of each small file (beg, end);

  2. For these small files, there is a shared hash-map using person's name as key and how many times it shows so far as value;

  3. For each small file, there is a single thread go through it, every time the thread encounters a person's name, it will increment its corresponding value in the shared hash-map;

  4. When all threads finish, I think it's time to sort the hash-map according to the value field.

But because there might be too many names in that file, so the sorting would be slow. I didn't come up with a good idea about how to output the names in descending order.

Hope anyone can help me with the above problem, give me a better solution on how to do the job via multithreading and the sorting stuff.

like image 469
Alcott Avatar asked Jul 19 '12 10:07

Alcott


1 Answers

Using a map-reduce approach could be a good idea for your problem. That approach would consist of two steps:

  1. Map: read chunks of data from the file and create a thread to process that data
  2. Reduce: the main thread waits for all other threads to finish and then it combines the results from each individual thread.

The advantage of this solution is that you would not need locking between the threads, since each one of them would operate on a different chunk of data. Using a shared data structure, as you are proposing, could be a solution too, but you may have some overhead due to contention for locking.

You need to do the sorting part at the reduce step, when the data from all the threads is available. But you might want to do some work during the map step, so that it is easier (quicker) to finish the complete sort at the reduce step.

If you prefer to avoid the sequential sorting at the end, you could use some custom data structure. I would use a map (something like a red-black tree or a hash table) for quickly finding a name. Moreover, I would use a heap in order to keep the order of frequencies among names. Of course, you would need to have parallel versions of those data structures. Depending on how coarse the parallelization is, you may have locking contention problems or not.

like image 152
betabandido Avatar answered Sep 27 '22 22:09

betabandido