Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding which bin a values fall into

I'm trying to find which category C a double x belongs to. My categories are defined as strings names and doubles values in a file like this

A 1.0
B 2.5
C 7.0 

which should be interpreted like this

"A": 0 < x <= 1.0
"B": a < x <= 2.5
"C": b < x <= 7.0

(the input can have arbitrary length and may have to be sorted by their values). I simply need a function like this

std::string findCategory(categories_t categories, double x) {
    ...insert magic here
}

so for this example I'd expect

findCategory(categories, 0.5) == "A"
findCategory(categories, 1.9) == "B"
findCategory(categories, 6.0) == "C"

So my question is a) how to write the function and b) what the best choice of category_t may be (using stl in pre 11 C++). I made several attempts, all of which were... less than successful.

like image 573
nxpnsv Avatar asked Jan 19 '12 20:01

nxpnsv


2 Answers

One option would be to use the std::map container with doubles as keys and values corresponding to what value is assigned to the range whose upper endpoint is the given value. For example, given your file, you would have a map like this:

std::map<double, std::string> lookup;
lookup[1.0] = "A";
lookup[2.5] = "B";
lookup[7.0] = "C";

Then, you could use the std::map::lower_bound function, given some point, to get back the key/value pair whose key (upper endpoint) is the first key in the map that is at least as large as the point in question. For example, with the above map, lookup.lower_bound(1.37) would return an iterator whose value is "B." lookup.lower_bound(2.56) would return an iterator whose value is "C." These lookups are fast; they take O(log n) time for a map with n elements.

In the above, I'm assuming that the values you're looking up are all nonnegative. If negative values are allowed, you can add a quick test in to check whether the value is negative before you do any lookups. That way, you can eliminate spurious results.

For what it's worth, if you happen to know something about the distribution of your lookups (say, they're uniformly distributed) it's possible to build a special data structure called an optimal binary search tree that will give better access times than the std::map. Also, depending on your application, there may be even faster options available. For example, if you're doing this because you want to randomly choose one of the outcomes with differing probabilities, then I would suggest looking into this article on the alias method, which lets you generate random values in O(1) time.

Hope this helps!

like image 144
templatetypedef Avatar answered Oct 07 '22 13:10

templatetypedef


You can use the pair type and the 'lower_bound' from < algorithm > http://www.cplusplus.com/reference/algorithm/lower_bound/.

Let's define your categories in terms of the upper edge: typedef pair categories_t;

Then just make a vector of those edges and search it using binary search. See the full example below.

#include <string>
#include <vector>
#include <algorithm>
#include <iostream>

using namespace std;
typedef pair<double,string> category_t;

std::string findCategory(const vector<category_t> &categories, double x) {
   vector<category_t>::const_iterator it=std::lower_bound(categories.begin(), categories.end(),category_t(x,""));
   if(it==categories.end()){
      return "";
   }
   return it->second;
}

int main (){

   vector< category_t > edges;
   edges.push_back(category_t(0,"bin n with upper edge at 0 (underflow)"));
   edges.push_back(category_t(1,"bin A with upper edge at 1"));
   edges.push_back(category_t(2.5,"bin B with upper edge at 2.5"));
   edges.push_back(category_t(7,"bin C with upper edge at 7"));
   edges.push_back(category_t(8,"bin D with upper edge at 8"));
   edges.push_back(category_t(9,"bin E with upper edge at 9"));
   edges.push_back(category_t(10,"bin F with upper edge at 10"));

   vector< double > examples ;
   examples.push_back(1);
   examples.push_back(3.3);
   examples.push_back(7.4);
   examples.push_back(-5);
   examples.push_back(15);

   for( vector< double >::const_iterator eit =examples.begin();eit!=examples.end();++eit)
      cout << "value "<< *eit << " : " << findCategory(edges,*eit) << endl;   
}

The comparisons works the way we want it to since the double is the first in the pair, and pairs are compared first by comparing the first and then the second constituent. Else we would define a compare predicate as described at the page I linked above.

like image 44
Johan Lundberg Avatar answered Oct 07 '22 13:10

Johan Lundberg