Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Suggestions for Sparse Vector implementation without using outside libraries

I have to use a sparse vector for some things in my current project. But, since I am not in charge of the project, I can not use whatever outside libraries I want. I only have STL and OpenCV available.

I have looked through several stackoverflow answered questions, but they either focus on a specific approach, comparison of a limited number of approaches (2) and outside libraries when they deal specifically with sparse vectors. There is also some excellent ideas for implementing a sparse matrix.

What I want is specifically a sparse vector (the indexes will always be in 1 dimension, the data is not relevant for this question). I would like something that will not be a project on it's own to implement, but could be used for more-than-demonstrative-purposes (e.g. I want to achieve a decent speed and not too much of a memory overhead) and hopefully re-used later. The options I have considered included:

  • adapting the OpenCV implementation of SparseMat to my needs
  • using a std::map to store the values (or maybe making a very simple wrapper that would return a default value in case of indexing a zero-element)
  • using a std::vector< std::pair < int , data_type > > where I could store the index and the data in the std::pair elements

Is any of these solutions better/worse for the general purpose use as a sparse vector? I know every approach to everything has it's ups and downs, but argumented suggestions on which approach to choose would be very much appreciated. Also, recommending an approach I have not considered would be more than welcome if anyone thinks he has a better suggestion.


The usage in my specific case is as follows:

  • the vector will most likely not be modified after it is created (right now I don't see any need for this, but I can not guarantee 100% that it will not come up)
  • a most common operation is anticipated to be dot product of two such vectors (so, more-or-less accessing the elements in a linear sequential way)
  • only lookup I can foresee right now is (maybe) checking weather a certain element is a zero-element or not
  • there will be around 500 non-zero elements expected
  • in short, most of the time the sparse vector would be used as the mathematical concept of a vector (a multi-dimensional point) without much need to examine each coordinate separately

Still, as I have written in my original question, I would like suggestions for a general-purpose Sparse Vector implementation.

like image 426
penelope Avatar asked May 29 '12 09:05

penelope


2 Answers

I believe std::map would give you the best results. SpareseMat, I don't know, but among the two other methods you mentioned, std::map will give you O(log(n)) lookup as well as O(log(n)) insertion and deletion. The vector however requires a search over all its data (so it has O(n) lookup). It has O(1) insertion, but O(n) removal. My guess is that you will have a lot of lookup, so most probably std::map is better for you.

Depending on your application, you might want to use the vector method during initial creation of the structure and then convert it to map once you start using it to get the best of both worlds (but often this is not the case, for example in the cases where you have repeating indices).

Besides hash which should give you O(1) everything, but in reality might not, the lookup of O(log(n)) is the best you can hope for. You may come up with a vector that you can binary search on, or any other method based on searching for the data by comparison, but in the end they would all be O(log(n)) so you might as well use the easy-already-done std::map.


Update: Based on the update on your question, that indicates that the vector will most likely not be modified after it is created and a most common operation is anticipated to be dot product, I would suggest the following:

First, use a vector of pairs as you have suggested yourself. During creation, simply push_back and you get O(1) performance.1 After that, you can sort the vector. The dot product would be quite simple2:

int dot = 0;
unsigned int index_v1 = 0, index_v2 = 0;
while (index_v1 < v1.size() && index_v2 < v2.size())
    if (v1[index_v1].first == v2[index_v2].first)
        dot += v1[index_v1++].second * v2[index_v2++].second;
    else if (v1[index_v1].first < v2[index_v2].first)
        ++index_v1;
    else
        ++index_v2;

Checking whether a certain element is a zero-element or not would be a simple binary search, checking whether the element could be found or not (O(log(n)) performance).

Given the fact that you would be using this structure as a point, I believe keeping it a vector would be better. You might want to later do cross product or other geometric operations.

Regarding the fact that you may need to insert something in the vector every now and then, then you would have to insert it in place (so the vector remains sorted). The performance would be O(n), but since it doesn't happen often, it shouldn't be an issue.

1 Unless you have millions of these vectors, O(1) and O(log(n)) for n ~= 500 should not really make any noticeable difference.

2 You could definitely also use a map and use iterators over it to do the dot product in order of index. The performance will be the same if std::map uses a threaded tree that lets you get to the next node in O(1).

like image 88
Shahbaz Avatar answered Sep 28 '22 05:09

Shahbaz


KISS

Start with the simplest solution, that is wrap std::map<size_t, YourData> into a specific sparse vector structure:

template <typename V>
class SparseVector {
public:

private:
    std::map<size_t, V> specificValues;
    V defaultValue;
};

The map will give good default performances in all use cases (find/insert/update) and the use of a custom class means that if you ever need to change to another implementation, you will not have to change the clients sources, just recompile.

like image 34
Matthieu M. Avatar answered Sep 28 '22 06:09

Matthieu M.