process a large file via multithreading

Question

There is a quite large file (>10G) on the disk, each line inside the fie is composed of a line-number and a person's name, like this:

1 Jane
2 Perk
3 Sime
4 Perk
.. ..

I have to read this large file, and find the frequency of each name, finally output the results in descending order of each name's frequency, like this:

Perk 2
Jane 1
Sime 1

As the interviewer requested, the above job should be done as efficiently as possible, and multithreading is allowed. And my solution is something like this:

Because the file is too large, I partition the file into several small files, each small file is about 100M, via lseek I can locate the begin and the end of each small file (beg, end);
For these small files, there is a shared hash-map using person's name as key and how many times it shows so far as value;
For each small file, there is a single thread go through it, every time the thread encounters a person's name, it will increment its corresponding value in the shared hash-map;
When all threads finish, I think it's time to sort the hash-map according to the value field.

But because there might be too many names in that file, so the sorting would be slow. I didn't come up with a good idea about how to output the names in descending order.

Hope anyone can help me with the above problem, give me a better solution on how to do the job via multithreading and the sorting stuff.

betabandido · Accepted Answer

Using a map-reduce approach could be a good idea for your problem. That approach would consist of two steps:

Map: read chunks of data from the file and create a thread to process that data
Reduce: the main thread waits for all other threads to finish and then it combines the results from each individual thread.

The advantage of this solution is that you would not need locking between the threads, since each one of them would operate on a different chunk of data. Using a shared data structure, as you are proposing, could be a solution too, but you may have some overhead due to contention for locking.

You need to do the sorting part at the reduce step, when the data from all the threads is available. But you might want to do some work during the map step, so that it is easier (quicker) to finish the complete sort at the reduce step.

If you prefer to avoid the sequential sorting at the end, you could use some custom data structure. I would use a map (something like a red-black tree or a hash table) for quickly finding a name. Moreover, I would use a heap in order to keep the order of frequencies among names. Of course, you would need to have parallel versions of those data structures. Depending on how coarse the parallelization is, you may have locking contention problems or not.

process a large file via multithreading

Tags:

c

algorithm

large-files

Alcott

1 Answers

betabandido

Recent Activity

Donate For Us

process a large file via multithreading

Tags:

c

algorithm

large-files

Alcott

1 Answers

betabandido

Related questions

Recent Activity

Donate For Us