Problem: I've a huge raw text file (assume of 3gig), I need to go through each word in the file and find out that a word appears how many times in the file.
My Proposed Solution: Split the huge file into multiple files and each splitted file will have words in a sorted manner. For example, all the words starting with "a" will be stored in a "_a.dic" file. So, at any time we will not execeed more than 26 files.
The problem in this approach is,
I can use streams to read the file, but wanted to use threads to read certain parts of the file. For example, read 0-1024 bytes with a separate thread (atleast have 4-8 threads based on the no. of processors exist in the box). Is this is possible or am I dreaming?
Any better approach?
Note: It should be a pure c++ or c based solution. No databases etc., are allowed.
You need to look at 'The Practice of Programming' by Kernighan and Pike, and specifically chapter 3.
In C++, use a map based on the strings and a count (std::map<string,size_t>
, IIRC). Read the file (once - it's too big to read more than once), splitting it into words as you go (for some definition of 'word'), and incrementing the count in the map entry for each word you find.
In C, you'll have to create the map yourself. (Or find David Hanson's "C Interfaces and Implementations".)
Or you can use Perl, or Python, or Awk (all of which have associative arrays, equivalent to a map).
I don't think using multiple threads that read parts of the file in parallel is going to help much. I would expect that this application is bound to the bandwidth and latency of your harddisk, not the actual word counting. Such a multi-threaded version might actually perform worse because "quasi-random" file access is typically slower than "linear file" access.
In case the CPU is really busy in a single-threaded version there might be a potential speed up. One thread could read the data in big chunks and put them into a queue of limited capacity. A bunch of other worker threads could operate each on their own chunk and count the words. After the counting worker threads finished you have to merge the word counters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With