Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recommend C++ container to hold top 20 minimum values

In SQL there is the feature to say something like

SELECT TOP 20 distance FROM dbFile ORDER BY distance ASC

If my SQL is correct with, say 10,000 records, this should return the 20 smallest distances in my databse.

I don't have a database. I have a 100,000-element simple array.

Is there a C++ container, Boost, MFC or STL that provides simple code for a struct like

struct closest{
    int    ID;
    double distance;
    closest():ID(-1), distance(std::numeric_limits<double>::max( )){}
};

Where I can build a sorted by distance container like

boost::container::XXXX<closest> top(20);

And then have a simple

top.replace_if(closest(ID,Distance));

Where the container will replace the entry with the current highest distance in my container with my new entry if it is less than the current highest distance in my container.

I am not worried about speed. I like elegant clean solutions where containers and code do all the heavy lifting.

EDIT. Addendum after all the great answers received.

What I really would of liked to have found, due to its elegance. Is a sorted container that I could create with a container size limit. In my case 20. Then I could push or insert to my hearts content a 100 000 items or more. But. There is always a but. The container would of maintained the max size of 20 by replacing or not inserting an item if its comparator value was not within the lowest 20 values.

Yes. I know now from all these answers that via programming and tweaking existing containers the same effect can be achieved. Perhaps when the next round of suggestions for the C & C++ standards committee sits. We could suggest. Self sorting (which we kind of have already) and self size limiting containers.

like image 882
kingchris Avatar asked Dec 20 '15 07:12

kingchris


People also ask

What is container c++?

A container is an object that stores a collection of objects of a specific type. For example, if we need to store a list of names, we can use a vector . C++ STL provides different types of containers based on our requirements.

Which container has the simplest internal data structure?

So what are the Thumb rules for using a specific container: By default, you should use a vector. It has the simplest internal data structure and provides random access.

What is the fastest container in C++?

Overall, for insertions, the vector and deque are the fastest for small types and the list is the fastest for the very large types.

When to use std deque?

The std::deque container is most efficient when appending items to the front or back of a queue. Unlike std::vector , std::deque does not provide a mechanism to reserve a buffer. The underlying buffer is also not guaranteed to be compatible with C-style array APIs.


8 Answers

What you need is to have a maxheap of size 20. Recall that the root of your heap will be the largest value in the heap.

This heap will contain the records with smallest distance that you have encountered so far. For the first 20 out of 10000 values you just push to the heap.

At this point you iterate through the rest of the records and for each record, you compare it to the root of your heap.

Remember that the root of your heap is basically the very worst of the very best.(The record with the largest distance, among the 20 records with the shortest distance you have encountered so far).

If the value you are considering is not worth keeping (its distance is larger that the root of your tree), ignore that record and just keep moving.

Otherwise you pop your heap (get rid of the root) and push the new value in. The priority queue will automatically put its record with the largest distance on the root again.

Once you keep doing this over the entire set of 10000 values, you will be left with the 20 records that have the smallest distance, which is what you want.

Each push-pop takes constant O(1) time, iterating through all inputs of N is O(n) so this is a Linear solution.

Edit: I thought it would be useful to show my idea in C++ code. This is a toy example, you can write a generic version with templates but I chose to keep it simple and minimalistic:

#include <iostream>
#include <queue>
using namespace std;
class smallestElements
{
private:
    priority_queue<int,std::vector<int>,std::less<int> > pq;
    int maxSize;
public:
    smallestElements(int size): maxSize(size)
    {
    pq=priority_queue<int, std::vector<int>, std::less<int> >();
    }
    void possiblyAdd(int newValue)
    {
        if(pq.size()<maxSize)
        {
            pq.push(newValue);
            return;
        }
        if(newValue < pq.top())
        {
            pq.pop(); //get rid of the root
            pq.push(newValue); //priority queue will automatically restructure
        }
    }
    void printAllValues()
    {
        priority_queue<int,std::vector<int>,std::less<int> > cp=pq;
        while(cp.size()!=0)
        {
            cout<<cp.top()<<" ";
            cp.pop();
        }
        cout<<endl;
    }
};

How you use this is really straight forward. basically in your main function somewhere you will have:

smallestElements se(20); //we want 20 smallest
//...get your stream of values from wherever you want, call the int x
se.possiblyAdd(x); //no need for bounds checking or anything fancy
//...keep looping or potentially adding until the end

se.printAllValues();//shows all the values in your container of smallest values
// alternatively you can write a function to return all values if you want
like image 93
ForeverStudent Avatar answered Nov 02 '22 21:11

ForeverStudent


If this is about filtering the 20 smallest elements from a stream on the fly, then a solution based on std::priority_queue (or std::multiset) is the way to go.

However, if it is about finding the 20 smallest elements in a given arraym I wouldn't go for a special container at all, but simply the algorithm std::nth_element - a partial sorting algorithm that will give you the n smallest elements - EDIT: or std::partial_sort (thanks Jarod42) if the elements also have to be sorted. It has linear complexity and it's just a single line to write (+ the comparison operator, which you need in any case):

#include <vector>
#include <iostream>
#include <algorithm>

struct Entry {
    int ID;
    double distance;    
};

std::vector<Entry> data;    

int main() {
    //fill data;

    std::nth_element(data.begin(), data.begin() + 19, data.end(), 
        [](auto& l, auto& r) {return l.distance < r.distance; });

    std::cout << "20 elements with smallest distance: \n"; 
    for (size_t i = 0; i < 20; ++i) {
        std::cout << data[i].ID << ":" << data[i].distance << "\n";
    }
    std::cout.flush();
}

If you don't want to change the order of your original array, you would have to make a copy of the whole array first though.

like image 38
MikeMB Avatar answered Nov 02 '22 21:11

MikeMB


My first idea would be using a std::map or std::set with a custom comparator for this (edit: or even better, a std::priority_queue as mentioned in the comments).

Your comparator does your sorting.

You essentially add all your elements to it. After an element has been added, check whether there are more than n elements inside. If there are, remove the last one.

like image 25
Mario Avatar answered Nov 02 '22 21:11

Mario


I am not 100% sure, that there is no more elegant solution, but even std::set is pretty pretty.

All you have to do is to define a proper comparator for your elements (e.g. a > operator) and then do the following:

std::set<closest> tops(arr, arr+20)
tops.insert(another);
tops.erase(tops.begin());
like image 23
Marandil Avatar answered Nov 02 '22 22:11

Marandil


I would use nth_element like @juanchopanza suggested before he deleted it.

His code looked like:

bool comp(const closest& lhs, const closest& rhs)
{
    return lhs.distance < rhs.distance;
}

then

std::vector<closest> v = ....;
nth_element(v.begin(), v.begin() + 20, v.end(), comp);

Though if it was only ever going to be twenty elements then I would use a std::array.

like image 26
graham.reeds Avatar answered Nov 02 '22 23:11

graham.reeds


Just so you can all see what I am currently doing which seems to work.

struct closest{
    int base_ID;
    int ID;
    double distance;

    closest(int BaseID, int Point_ID, 
    double Point_distance):base_ID(BaseID), 
    ID(Point_ID),distance(Point_distance){}
    closest():base_ID(-1), ID(-1),
    distance(std::numeric_limits<double>::max( )){}

    bool operator<(const closest& rhs) const
    {
        return distance < rhs.distance;
    }
};



void calc_nearest(void) 
{ 
    boost::heap::priority_queue<closest> svec;

    for (int current_gift = 0; current_gift < g_nVerticesPopulated; ++current_gift)
    {   double best_distance=std::numeric_limits<double>::max();    
        double our_distance=0.0;
        svec.clear();

        for (int all_other_gifts = 0; all_other_gifts < g_nVerticesPopulated;++all_other_gifts)
        {   
            our_distance = distanceVincenty(g_pVertices[current_gift].lat,g_pVertices[current_gift].lon,g_pVertices[all_other_gifts].lat,g_pVertices[all_other_gifts].lon);
            if (our_distance != 0.0)
            {
                if (our_distance < best_distance) // don't bother to push and sort if the calculated distance is greater than current 20th value
                    svec.push(closest(g_pVertices[current_gift].ID,g_pVertices[current_gift].ID,our_distance));

                if (all_other_gifts%100 == 0)
                {
                    while (svec.size() > no_of_closest_points_to_calculate) svec.pop(); // throw away any points above no_of_closest_points_to_calculate
                    closest t =  svec.top(); // the furthest of the no_of_closest_points_to_calculate points for optimisation
                    best_distance = t.distance;
                }
            }
        }
        std::cout << current_gift << "\n";
    }
}

As you can see. I have 100 000 lat & long points draw on an openGl sphere. I am calculating each point against every other point and only retaining currently the closest 20 points. There is some primitive optimisation going on by not pushing a value if it is bigger than the 20th closest point.

As I am used to Prolog taking hours to solve something I am not worried about speed. I shall run this overnight.

Thanks to all for your help. It is much appreciated. Still have to audit the code and results but happy that I am moving in the right direction.

like image 20
kingchris Avatar answered Nov 02 '22 22:11

kingchris


I actually have 100 000 Lat & Lon points drawn on a opengl sphere. I want to work out the 20 nearest points to each of the 100 000 points. So we have two loops to pick each point then calculate that point against every other point and save the closest 20 points.

This reads as if you want to perform a k-nearest neighbor search in the first place. For this, you usually use specialized data structures (e.g., a binary search tree) to speed up the queries (especially when you are doing 100k of them).

For spherical coordinates you'd have to do a conversion to a cartesian space to fix the coordinate wrap-around. Then you'd use an Octree or kD-Tree.

Here's an approach using the Fast Library for Approximate Nearest Neighbors (FLANN):

#include <vector>
#include <random>
#include <iostream>
#include <flann/flann.hpp>
#include <cmath>

struct Point3d {
    float x, y, z;
    void setLatLon(float lat_deg, float lon_deg) {
        static const float r = 6371.; // sphere radius
        float lat(lat_deg*M_PI/180.), lon(lon_deg*M_PI/180.);
        x = r * std::cos(lat) * std::cos(lon);
        y = r * std::cos(lat) * std::sin(lon);
        z = r * std::sin(lat);
    }
};

std::vector<Point3d> random_data(std::vector<Point3d>::size_type n) {
    static std::mt19937 gen{std::random_device()()};
    std::uniform_int_distribution<> dist(0, 36000);
    std::vector<Point3d> out(n);
    for (auto &i: out)
        i.setLatLon(dist(gen)/100., dist(gen)/100.);
    return out;
}

int main() {
    // generate random spherical point cloud
    std::vector<Point3d> data = random_data(1000);
    // generate query point(s) on sphere
    std::vector<Point3d> query = random_data(1);

    // convert into library datastructures
    auto mat_data = flann::Matrix<float>(&data[0].x, data.size(), 3);
    auto mat_query = flann::Matrix<float>(&query[0].x, query.size(), 3);
    // build KD-Tree-based index data structure
    flann::Index<flann::L2<float> > index(mat_data, flann::KDTreeIndexParams(4));
    index.buildIndex();
    // perform query: approximate nearest neighbor search
    int k = 5; // number of neighbors to find
    std::vector<std::vector<int>> k_indices;
    std::vector<std::vector<float>> k_dists;
    index.knnSearch(mat_query, k_indices, k_dists, k, flann::SearchParams(128));

    // k_indices now contains for each query point the indices to the neighbors in the original point cloud
    // k_dists now contains for each query point the distances to those neighbors
    // removed printing of results for brevity
}

You'd receive results similar to this one (click to enlarge):

flann result

For reference:

  • https://en.wikipedia.org/wiki/Nearest_neighbor_search
  • https://en.wikipedia.org/wiki/Octree
  • https://en.wikipedia.org/wiki/Kd-tree
  • http://www.cs.ubc.ca/research/flann/
like image 32
moooeeeep Avatar answered Nov 02 '22 21:11

moooeeeep


Heap is the data structure that you need. pre-C++11 stl only had functions which managed heap data in your own arrays. Someone mentioned that boost has a heap class, but you don't need to go so far as to use boost if your data is simple integers. stl's heap will do just fine. And, of course, the algorithm is to order the heap so that the highest value is the first one. So with each new value, you push it on the heap, and, once the heap reaches 21 elements in size, you pop the first value from the heap. This way whatever 20 values remain are always the 20 lowest.

like image 40
Dmitry Rubanovich Avatar answered Nov 02 '22 22:11

Dmitry Rubanovich