Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need for Fast map between string and integers

Tags:

c++

dictionary

I have a map of string and unsigned, in which I store a word to its frequency of the following form:

map<string,unsigned> mapWordFrequency; //contains 1 billion such mappings

Then I read a huge file (100GB), and retain only those words in the file which have a frequency greater than 1000. I check for the frequency of the words in the file using: mapWordFrequency[word]>1000. However, it turns out as my mapWordFrequency has 1 billion mappings and my file is huge, therefore trying to check mapWordFrequency[word]>1000 for each and every word in the file is very slow and takes more than 2 days. Can someone please suggest as to how can I improve the efficiency of the above code.

map does not fit in my RAM and swapping is consuming a lot of time.

Would erasing all words having frequency < 1000 help using erase function of map?

like image 939
Rose Sharma Avatar asked Sep 01 '15 06:09

Rose Sharma


3 Answers

I suggest you use an unordered_map as opposed to a map. As already discussed in the comments, the former will give you an insertion/retrieval time of O(1) as opposed to O(logn) in a map.

As you have already said, memory swapping is consuming a lot of time. So how about tackling the problem incrementally. Load the maximum data and unordered_map you can into memory, hash it, and continue. After one pass, you should have a lot of unordered_maps, and you can start to combine them in subsequent passes.

You could improve the speed by doing this in a distributed manner. Processing the pieces of data on different computers, and then combining data (which will be in form of unordered maps. However, I have no prior experience in distributed computing, and so can't help beyond this.

Also, if implementing something like this is too cumbersome, I suggest you use external mergesort. It is a method of sorting a file too large to fit into memory by sorting smaller chunks and combining them. The reason I'm suggesting this is that external mergesort is a pretty common technique, and you might find already implemented solutions for your need. Even though the time complexity of sorting is higher than your idea of using a map, it will reduce the overhead in swapping as compared to a map. As pointed out in the comments, sort in linux implements external mergesort.

like image 174
therainmaker Avatar answered Sep 27 '22 17:09

therainmaker


You can use hash map where your hashed string will be key and occurrence will be value. It will be faster. You can choose a good string hashing based on your requirement. Here is link of some good hashing function:

http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

you can use some third party libraries for this also.

EDIT: pseudo code

int mapWordFrequency[MAX_SIZE] = {0} ;// if MAX_SIZE is large go with dynamic memory location
int someHashMethod(string input);

loop: currString in ListOfString
          int key = someHashMethod(currString);
          ++mapWordFrequency[key];
          if(mapWordFrequency[key] > 1000)
              doSomeThing();

Update: As @Jens pointed out there can be cases when someHashMethod() will return same int (hash) for two different string. In that case we have to resolve collision and then lookup time will be more than constant. Also as input size is very large creating a single array of that size may not be possible. In that case we may use distributed computing concepts but actual lookup time will again go up as compare to single machine.

like image 36
Nishant Kumar Avatar answered Sep 27 '22 16:09

Nishant Kumar


Depending on the statistical distribution of your words, it may be worth compressing each word before adding it to the map. As long as this is lossless compression you can recover the original words after filtering. The idea being you may be able to reduce the average word size (hence saving memory, and time comparing keys). Here's a simple compression/decompression procedure you could use:

#include <string>
#include <sstream>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <boost/iostreams/copy.hpp> 

inline std::string compress(const std::string& data)
{
    std::stringstream decompressed {data};
    boost::iostreams::filtering_streambuf<boost::iostreams::input> stream;
    stream.push(boost::iostreams::zlib_compressor());
    stream.push(decompressed);
    std::stringstream compressed {};
    boost::iostreams::copy(stream, compressed);
    return compressed.str();
}

inline std::string decompress(const std::string& data)
{
    std::stringstream compressed {data};
    boost::iostreams::filtering_streambuf<boost::iostreams::input> stream;
    stream.push(boost::iostreams::zlib_decompressor());
    stream.push(compressed);
    std::stringstream decompressed;
    boost::iostreams::copy(stream, decompressed);
    return decompressed.str();
}

In addition to using a std::unordered_map as other have suggested, you could also move any words that have already been seen more than 1000 times out of the map, and into a std::unordered_set. This would require also checking the set before the map, but you may see better hash performance by doing this. It may also be worth rehashing occasionally if you employ this strategy.

like image 33
Daniel Avatar answered Sep 27 '22 17:09

Daniel