Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

optimizing boost unordered map and sets, C++

Tags:

c++

c++11

I will be parsing 60GB of text and doing a lot of insert and lookups in maps. I just started using boost::unordered_set and boost::unordered_map As my program starts filling in these containers they start growing bigger and bigger and i was wondering if this would be a good idea to pre allocate memory for these containers. something like mymap::get_allocator().allocate(N); ?

or should i just leave them to allocate and figure out grow factors by themselves? the codes look like this

boost::unordered_map <string,long> words_vs_frequency, wordpair_vs_frequency;   
boost::unordered_map <string,float> word_vs_probability, wordpair_vs_probability,
           wordpair_vs_MI;                  
//... ... ...                                   

N = words_vs_frequency.size();
long   y =0; float MIWij =0.0f, maxMI=-999999.0f;
for (boost::unordered_map <string,long>::iterator i=wordpair_vs_frequency.begin(); 
                     i!=wordpair_vs_frequency.end(); ++i){
if (i->second >= BIGRAM_OCCURANCE_THRESHOLD)
    {
    y++;
    Wij = i->first;
    WordPairToWords(Wij, Wi,Wj);
    MIWij =  log ( wordpair_vs_probability[Wij] /
             (word_vs_probability[Wi] * word_vs_probability[Wj]) 
            );

    // keeping only the pairs which MI value greater than 
    if (MIWij > MUTUAL_INFORMATION_THRESHOLD)
        wordpair_vs_MI[ Wij ] = MIWij;
    if(MIWij > maxMI )
        maxMI = MIWij; 
    }

   }

Thanks in advance

like image 233
user109134 Avatar asked Dec 07 '22 07:12

user109134


2 Answers

According to the documentation, both unordered_set and unordered_map have a method

void rehash(size_type n);

that regenerates the hashtable so that it contains at least n buckets. (It sounds like it does what reserve() does for STL containers).

like image 127
j_random_hacker Avatar answered Dec 28 '22 08:12

j_random_hacker


I would try it both ways, which will let you generate hard data showing whether one method works better than the other. We can speculate all day about which method will be optimal, but as with most performance questions, the best thing to do is try it out and see what happens (and then fix the parts that actually need fixing).

That being said, the Boost authors seem to be very smart, so it quite possibly will work fine as-is. You'll just have to test and see.

like image 39
Charlie Avatar answered Dec 28 '22 07:12

Charlie