Most efficient way to find index of matching values in two sorted arrays using C++

Tags:

I currently have a solution but I feel it's not as efficient as it could be to this problem, so I want to see if there is a faster method to this.

I have two arrays (std::vectors for example). Both arrays contain only unique integer values that are sorted but are sparse in value, ie: 1,4,12,13... What I want to ask is there fast way I can find the INDEX to one of the arrays where the values are the same. For example, array1 has values 1,4,12,13 and array2 has values 2,12,14,16. The first matching value index is 1 in array2. The index into the array is what is important as I have other arrays that contain data that will use this index that "matches".

I am not confined to using arrays, maps are possible to. I am only comparing the two arrays once. They will not be reused again after the first matching pass. There can be small to large number of values (300,000+) in either array, but DO NOT always have the same number of values (that would make things much easier)

Worse case is a linear search O(N^2). Using map would get me better O(log N) but I would still have convert an array to into a map of value, index pairs.

What I currently have to not do any container type conversions is this. Loop over the smaller of the two arrays. Compare current element of small array (array1) with the current element of large array (array2). If array1 element value is larger than array2 element value, increment the index for array2 until is it no longer larger than array1 element value (while loop). Then, if array1 element value is smaller than array2 element, go to next loop iteration and begin again. Otherwise they must be equal and I have my index to either arrays of the matching value.

So in this loop, I am at best O(N) if all values have matches and at worse O(2N) if none match. So I am wondering if there is something faster out there? It's hard to know for sure how often the two arrays will match, but I would way I would lean more toward most of the arrays will mostly have matches than not.

I hope I explained the problem well enough and I appreciate any feedback or tips on improving this.

Code example:

std::vector<int> array1 = {4,6,12,34};
std::vector<int> array2 = {1,3,6,34,40};

for(unsigned int i=0, z=0; i < array1.size(); i++) 
{
    int value1 = array1[i];
    while(value1 > array2[z] && z < array2.size())
        z++;

    if (z >= array2.size())
        break; // reached end of array2

    if (value1 < array2[z])
        continue;

    // we have a match, i and z indices have same value

}

Result will be matching indexes for array1 = [1,3] and for array2= [2,3]

718

asked Apr 16 '16 20:04

scottiedoo

1 Answers

I wrote an implementation of this function using an algorithm that performs better with sparse distributions, than the trivial linear merge.

For distributions, that are similar^†, it has O(n) complexity but ranges where the distributions are greatly different, it should perform below linear, approaching O(log n) in optimal cases. However, I wasn't able to prove that the worst case isn't better than O(n log n). On the other hand, I haven't been able to find that worst case either.

I templated it so that any type of ranges can be used, such as sub-ranges or raw arrays. Technically it works with non-random access iterators as well, but the complexity is much greater, so it's not recommended. I think it should be possible to modify the algorithm to fall back to linear search in that case, but I haven't bothered.

^† By similar distribution, I mean that the pair of arrays have many crossings. By crossing, I mean a point where you would switch from one array to another if you were to merge the two arrays together in sorted order.

#include <algorithm>
#include <iterator>
#include <utility>

// helper structure for the search
template<class Range, class Out>
struct search_data {
    // is any there clearer way to get iterator that might be either
    // a Range::const_iterator or const T*?
    using iterator = decltype(std::cbegin(std::declval<Range&>()));
    iterator curr;
    const iterator begin, end;
    Out out;
};

template<class Range, class Out>
auto init_search_data(const Range& range, Out out) {
    return search_data<Range, Out>{
        std::begin(range),
        std::begin(range),
        std::end(range),
        out,
    };
}

template<class Range, class Out1, class Out2>
void match_indices(const Range& in1, const Range& in2, Out1 out1, Out2 out2) {
    auto search_data1 = init_search_data(in1, out1);
    auto search_data2 = init_search_data(in2, out2);

    // initial order is arbitrary
    auto lesser = &search_data1;
    auto greater = &search_data2;

    // if either range is exhausted, we are finished
    while(lesser->curr != lesser->end
            && greater->curr != greater->end) {
        // difference of first values in each range
        auto delta = *greater->curr - *lesser->curr;

        if(!delta) { // matching value was found
            // store both results and increment the iterators
            *lesser->out++ = std::distance(lesser->begin, lesser->curr++);
            *greater->out++ = std::distance(greater->begin, greater->curr++);
            continue; // then start a new iteraton
        }

        if(delta < 0) { // set the order of ranges by their first value
            std::swap(lesser, greater);
            delta = -delta; // delta is always positive after this
        }

        // next crossing cannot be farther than the delta
        // this assumption has following pre-requisites: 
        // range is sorted, values are integers, values in the range are unique
        auto range_left = std::distance(lesser->curr, lesser->end);
        auto upper_limit =
            std::min(range_left, static_cast<decltype(range_left)>(delta));

        // exponential search for a sub range where the value at upper bound
        // is greater than target, and value at lower bound is lesser
        auto target = *greater->curr;
        auto lower = lesser->curr;
        auto upper = std::next(lower, upper_limit);
        for(int i = 1; i < upper_limit; i *= 2) {
            auto guess = std::next(lower, i);
            if(*guess >= target) {
                upper = guess;
                break;
            }
            lower = guess;
        }

        // skip all values in lesser,
        // that are less than the least value in greater
        lesser->curr = std::lower_bound(lower, upper, target);
    }
}

#include <iostream>
#include <vector>

int main() {
    std::vector<int> array1 = {4,6,12,34};
    std::vector<int> array2 = {1,3,6,34};

    std::vector<std::size_t> indices1;
    std::vector<std::size_t> indices2;

    match_indices(array1, array2,
                  std::back_inserter(indices1),
                  std::back_inserter(indices2));

    std::cout << "indices in array1: ";
    for(std::vector<int>::size_type i : indices1)
        std::cout << i << ' ';

    std::cout << "\nindices in array2: ";
    for(std::vector<int>::size_type i : indices2)
        std::cout << i << ' ';
    std::cout << std::endl;
}

answered Sep 20 '22 17:09

eerorika

Related questions
                            
                                stringstream.rdbuf causing cout to fail
                            
                                Constexpr decltype
                            
                                c++ socket: size of the structure addrinfo
                            
                                Can I print an object of the class using printf()?
                            
                                c++ Large eigendecomposition speed
                            
                                Read Write lock implementation in C++
                            
                                Get windows Quiet hours from Win32 or C# API
                            
                                C++ CSV line with commas and strings within double quotes
                            
                                How to implement the RVExtension function for an ArmA 3 DLL in Rust?
                            
                                C++ Calling overloaded operator from within a class
                            
                                Does sizeof(T) * CHAR_BIT guarantee bit size?
                            
                                Why is the size of an array passed to a function by reference known to the compiler in C++?
                            
                                Running .exe without copying .dlls
                            
                                Gtest: Expected Class-Name Before '{'
                            
                                C++: How does the compiler know how much memory to allocate for each stack frame?
                            
                                What is the fastest way to calculate determinant?
                            
                                What is the difference between temporary variable and constant in C++?
                            
                                Wrapping std::array in Cython and Exposing it to memory views
                            
                                Is it possible to compile Emscripten as easily as now, but without the console and emscripten logo?
                            
                                lower_bound for vector<MyClass*>

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Most efficient way to find index of matching values in two sorted arrays using C++

Tags:

c++

arrays

loops

algorithm

matching

scottiedoo

People also ask

1 Answers

eerorika

Recent Activity

Donate For Us