Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When does a map become better than two vectors?

A map does binary search on all its elements, which has logarithmic complexity — this means that for a small enough collection of objects, a map will perform worse than two vectors that have linear search.

How large should the object (key) pool be for a map to start performing better than two vectors?

Edit: A more generalized version of the question: how large should the object pool be for binary search to perform better than linear search?

I'm using strings as keys and the values are pointers, but my particular use case probably shouldn't matter. I'm more curious to understand how to use the two tools properly.

like image 320
Paul Manta Avatar asked Oct 24 '11 17:10

Paul Manta


2 Answers

If you'll forgive my saying so, most of the answers sound to me like various ways of saying: "I don't know", without actually admitting that they don't know. While I generally agree with the advice they've given, none of them seems to have attempted to directly address the question you asked: what is the break-even point.

To be fair, when I read the question, I didn't really know either. It's one of those things what we all know the basics: for a small enough collection, a linear search will probably be faster, and for a large enough collection, a binary search will undoubtedly be faster. I, however, have never really had much reason to investigate anything about what the break-even point would really be. Your question got me curious, however, so I decided to write a bit of code to get at least some idea.

This code is definitely a very quick hack (lots of duplication, only currently supports one type of key, etc.) but at least it might give some idea of what to expect:

#include <set>
#include <vector>
#include <string>
#include <time.h>
#include <iostream>
#include <algorithm>

int main() {   

    static const int reps = 100000;

    std::vector<int> data_vector;
    std::set<int> data_set;
    std::vector<int> search_keys;

    for (int size=10; size<100; size += 10) {
        data_vector.clear();
        data_set.clear();
        search_keys.clear();

        int num_keys = size / 10;   
        for (int i=0; i<size; i++) {
            int key = rand();

            if (i % num_keys == 0)
                search_keys.push_back(key);

            data_set.insert(key);
            data_vector.push_back(key);
        }

        // Search for a few keys that probably aren't present.
        for (int i=0; i<10; i++)
            search_keys.push_back(rand());

        long total_linear =0, total_binary = 0;

        clock_t start_linear = clock();
        for (int k=0; k<reps; k++) {
            for (int j=0; j<search_keys.size(); j++) {
                std::vector<int>::iterator pos = std::find(data_vector.begin(), data_vector.end(), search_keys[j]);
                if (pos != data_vector.end())
                    total_linear += *pos;
            }
        }
        clock_t stop_linear = clock();                      

        clock_t start_binary = clock();
        for (int k=0; k<reps; k++) {
            for (int j=0; j<search_keys.size(); j++) {
                std::set<int>::iterator pos = data_set.find(search_keys[j]);
                if (pos != data_set.end())
                    total_binary += *pos;
            }
        }
        clock_t stop_binary = clock();

        std::cout << "\nignore: " << total_linear << " ignore also: " << total_binary << "\n";

        std::cout << "For size = " << size << "\n";
        std::cout << "\tLinear search time = " << stop_linear - start_linear << "\n";
        std::cout << "\tBinary search time = " << stop_binary - start_binary << "\n";
    }
    return 0;
}

Here are the results I get running this on my machine:

ignore: 669830816 ignore also: 669830816
For size = 10
        Linear search time = 37
        Binary search time = 45

ignore: 966398112 ignore also: 966398112
For size = 20
        Linear search time = 60
        Binary search time = 47

ignore: 389263520 ignore also: 389263520
For size = 30
        Linear search time = 83
        Binary search time = 63

ignore: -1561901888 ignore also: -1561901888
For size = 40
        Linear search time = 106
        Binary search time = 65

ignore: -1741869184 ignore also: -1741869184
For size = 50
        Linear search time = 127
        Binary search time = 69

ignore: 1130798112 ignore also: 1130798112
For size = 60
        Linear search time = 155
        Binary search time = 75

ignore: -1949669184 ignore also: -1949669184
For size = 70
        Linear search time = 173
        Binary search time = 83

ignore: -1991069184 ignore also: -1991069184
For size = 80
        Linear search time = 195
        Binary search time = 90

ignore: 1750998112 ignore also: 1750998112
For size = 90
        Linear search time = 217
        Binary search time = 79

Obviously that's not the only possible test (or even close to the best one possible), but it seems to me that even a little hard data is better than none at all.

Edit: I would note for the record that I see no reason that code using two vectors (or a vector of pairs) can't be just as clean as code using a set or map. Obviously you'd want to put the code for it into a small class of its own, but I see no reason at all that it couldn't present precisely the same interface to the outside world that map does. In fact, I'd probably just call it a "tiny_map` (or something on that order).

One of the basic points of OO programming (and it remains the case in generic programming, to least some degree) is to separate the interface from the implementation. In this case, you're talking about purely an implementation detail that need not affect the interface at all. In fact, if I were writing a standard library, I'd be tempted to incorporate this as a "small map optimization" analogous to the common small string optimization. Basically, just allocate an array of 10 (or so) objects of value_type directly in the map object itself, and use them when/if the map is small, then move the data to a tree iff it grows large enough to justify it. The only real question is whether people us tiny maps often enough to justify the work.

like image 56
Jerry Coffin Avatar answered Sep 18 '22 23:09

Jerry Coffin


The code for map will be much cleaner than the code for two vectors; that should be your primary concern. Only once you've determined that the performance of map is a problem in your own application should you worry about the difference, and at that point you should benchmark it yourself.

like image 45
Mark Ransom Avatar answered Sep 17 '22 23:09

Mark Ransom