I have a map of string and unsigned, in which I store a word to its frequency of the following form:
map<string,unsigned> mapWordFrequency; //contains 1 billion such mappings
Then I read a huge file (100GB), and retain only those words in the file which have a frequency greater than 1000. I check for the frequency of the words in the file using: mapWordFrequency[word]>1000. However, it turns out as my mapWordFrequency has 1 billion mappings and my file is huge, therefore trying to check mapWordFrequency[word]>1000 for each and every word in the file is very slow and takes more than 2 days. Can someone please suggest as to how can I improve the efficiency of the above code.
map does not fit in my RAM and swapping is consuming a lot of time.
Would erasing all words having frequency < 1000 help using erase function of map?
I suggest you use an unordered_map
as opposed to a map
. As already discussed in the comments, the former will give you an insertion/retrieval time of O(1)
as opposed to O(logn)
in a map
.
As you have already said, memory swapping is consuming a lot of time. So how about tackling the problem incrementally. Load the maximum data and unordered_map
you can into memory, hash it, and continue. After one pass, you should have a lot of unordered_maps, and you can start to combine them in subsequent passes.
You could improve the speed by doing this in a distributed manner. Processing the pieces of data on different computers, and then combining data (which will be in form of unordered maps. However, I have no prior experience in distributed computing, and so can't help beyond this.
Also, if implementing something like this is too cumbersome, I suggest you use external mergesort. It is a method of sorting a file too large to fit into memory by sorting smaller chunks and combining them. The reason I'm suggesting this is that external mergesort is a pretty common technique, and you might find already implemented solutions for your need. Even though the time complexity of sorting is higher than your idea of using a map
, it will reduce the overhead in swapping as compared to a map. As pointed out in the comments, sort
in linux implements external mergesort.
You can use hash map where your hashed string will be key and occurrence will be value. It will be faster. You can choose a good string hashing based on your requirement. Here is link of some good hashing function:
http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
you can use some third party libraries for this also.
EDIT: pseudo code
int mapWordFrequency[MAX_SIZE] = {0} ;// if MAX_SIZE is large go with dynamic memory location
int someHashMethod(string input);
loop: currString in ListOfString
int key = someHashMethod(currString);
++mapWordFrequency[key];
if(mapWordFrequency[key] > 1000)
doSomeThing();
Update: As @Jens pointed out there can be cases when someHashMethod() will return same int (hash) for two different string. In that case we have to resolve collision and then lookup time will be more than constant. Also as input size is very large creating a single array of that size may not be possible. In that case we may use distributed computing concepts but actual lookup time will again go up as compare to single machine.
Depending on the statistical distribution of your words, it may be worth compressing each word before adding it to the map. As long as this is lossless compression you can recover the original words after filtering. The idea being you may be able to reduce the average word size (hence saving memory, and time comparing keys). Here's a simple compression/decompression procedure you could use:
#include <string>
#include <sstream>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <boost/iostreams/copy.hpp>
inline std::string compress(const std::string& data)
{
std::stringstream decompressed {data};
boost::iostreams::filtering_streambuf<boost::iostreams::input> stream;
stream.push(boost::iostreams::zlib_compressor());
stream.push(decompressed);
std::stringstream compressed {};
boost::iostreams::copy(stream, compressed);
return compressed.str();
}
inline std::string decompress(const std::string& data)
{
std::stringstream compressed {data};
boost::iostreams::filtering_streambuf<boost::iostreams::input> stream;
stream.push(boost::iostreams::zlib_decompressor());
stream.push(compressed);
std::stringstream decompressed;
boost::iostreams::copy(stream, decompressed);
return decompressed.str();
}
In addition to using a std::unordered_map
as other have suggested, you could also move any words that have already been seen more than 1000 times out of the map, and into a std::unordered_set
. This would require also checking the set before the map, but you may see better hash performance by doing this. It may also be worth rehashing occasionally if you employ this strategy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With