Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

combine two arrays and sort

Tags:

python

numpy

Given two sorted arrays like the following:

a = array([1,2,4,5,6,8,9])

b = array([3,4,7,10])

I would like the output to be:

c = array([1,2,3,4,5,6,7,8,9,10])

or:

c = array([1,2,3,4,4,5,6,7,8,9,10])

I'm aware that I can do the following:

c = unique(concatenate((a,b))

I'm just wondering if there is a faster way to do it as the arrays I'm dealing with have millions of elements.

Any idea is welcomed. Thanks

like image 648
Jun Avatar asked Sep 14 '12 15:09

Jun


People also ask

How do you combine arrays together?

To merge elements from one array to another, we must first iterate(loop) through all the array elements. In the loop, we will retrieve each element from an array and insert(using the array push() method) to another array. Now, we can call the merge() function and pass two arrays as the arguments for merging.

Can you use merge sort on an array?

MergeSort AlgorithmTo sort an entire array, we need to call MergeSort(A, 0, length(A)-1) . As shown in the image below, the merge sort algorithm recursively divides the array into halves until we reach the base case of array with 1 element.


2 Answers

Since you use numpy, I doubt that bisec helps you at all... So instead I would suggest two smaller things:

  1. Do not use np.sort, use c.sort() method instead which sorts the array in place and avoids the copy.
  2. np.unique must use np.sort which is not in place. So instead of using np.unique do the logic by hand. IE. first sort (in-place) then do the np.unique method by hand (check also its python code), with flag = np.concatenate(([True], ar[1:] != ar[:-1])) with which unique = ar[flag] (with ar being sorted). To be a bit better, you should probably make the flag operation in place itself, ie. flag = np.ones(len(ar), dtype=bool) and then np.not_equal(ar[1:], ar[:-1], out=flag[1:]) which avoids basically one full copy of flag.
  3. I am not sure about this. But .sort has 3 different algorithms, since your arrays maybe are almost sorted already, changing the sorting method might make a speed difference.

This would make the full thing close to what you got (without doing a unique beforehand):

def insort(a, b, kind='mergesort'):
    # took mergesort as it seemed a tiny bit faster for my sorted large array try.
    c = np.concatenate((a, b)) # we still need to do this unfortunatly.
    c.sort(kind=kind)
    flag = np.ones(len(c), dtype=bool)
    np.not_equal(c[1:], c[:-1], out=flag[1:])
    return c[flag]
like image 142
seberg Avatar answered Sep 21 '22 13:09

seberg


Inserting elements into the middle of an array is a very inefficient operation as they're flat in memory, so you'll need to shift everything along whenever you insert another element. As a result, you probably don't want to use bisect. The complexity of doing so would be around O(N^2).

Your current approach is O(n*log(n)), so that's a lot better, but it's not perfect.

Inserting all the elements into a hash table (such as a set) is something. That's going to take O(N) time for uniquify, but then you need to sort which will take O(n*log(n)). Still not great.

The real O(N) solution involves allocated an array and then populating it one element at a time by taking the smallest head of your input lists, ie. a merge. Unfortunately neither numpy nor Python seem to have such a thing. The solution may be to write one in Cython.

It would look vaguely like the following:

def foo(numpy.ndarray[int, ndim=1] out,
        numpy.ndarray[int, ndim=1] in1, 
        numpy.ndarray[int, ndim=1] in2):

        cdef int i = 0
        cdef int j = 0
        cdef int k = 0
        while (i!=len(in1)) or (j!=len(in2)):
            # set out[k] to smaller of in[i] or in[j]
            # increment k
            # increment one of i or j
like image 22
jleahy Avatar answered Sep 18 '22 13:09

jleahy