Given two sorted arrays like the following: <pre class="prettyprint"><code>a = array([1,2,4,5,6,8,9]) b = array([3,4,7,10]) </code></pre> I would like the output to be: <pre class="prettyprint"><code>c = array([1,2,3,4,5,6,7,8,9,10]) </code></pre> or: <pre class="prettyprint"><code>c = array([1,2,3,4,4,5,6,7,8,9,10]) </code></pre> I'm aware that I can do the following: <pre class="prettyprint"><code>c = unique(concatenate((a,b)) </code></pre> I'm just wondering if there is a faster way to do it as the arrays I'm dealing with have millions of elements. Any idea is welcomed. Thanks

Since you use numpy, I doubt that bisec helps you at all... So instead I would suggest two smaller things: <ol> <li>Do not use <code>np.sort</code>, use <code>c.sort()</code> method instead which sorts the array in place and avoids the copy.</li> <li> <code>np.unique</code> must use <code>np.sort</code> which is not in place. So instead of using <code>np.unique</code> do the logic by hand. IE. first sort (in-place) then do the <code>np.unique</code> method by hand (check also its python code), with <code>flag = np.concatenate(([True], ar[1:] != ar[:-1]))</code> with which <code>unique = ar[flag]</code> (with ar being sorted). To be a bit better, you should probably make the flag operation in place itself, ie. <code>flag = np.ones(len(ar), dtype=bool)</code> and then <code>np.not_equal(ar[1:], ar[:-1], out=flag[1:])</code> which avoids basically one full copy of <code>flag</code>.</li> <li>I am not sure about this. But <code>.sort</code> has 3 different algorithms, since your arrays maybe are almost sorted already, changing the sorting method might make a speed difference.</li> </ol> This would make the full thing close to what you got (without doing a unique beforehand): <pre class="prettyprint"><code>def insort(a, b, kind='mergesort'): # took mergesort as it seemed a tiny bit faster for my sorted large array try. c = np.concatenate((a, b)) # we still need to do this unfortunatly. c.sort(kind=kind) flag = np.ones(len(c), dtype=bool) np.not_equal(c[1:], c[:-1], out=flag[1:]) return c[flag] </code></pre>

Inserting elements into the middle of an <code>array</code> is a very inefficient operation as they're flat in memory, so you'll need to shift everything along whenever you insert another element. As a result, you probably don't want to use <code>bisect</code>. The complexity of doing so would be around <code>O(N^2)</code>. Your current approach is <code>O(n*log(n))</code>, so that's a lot better, but it's not perfect. Inserting all the elements into a hash table (such as a <code>set</code>) is something. That's going to take <code>O(N)</code> time for uniquify, but then you need to sort which will take <code>O(n*log(n))</code>. Still not great. The real <code>O(N)</code> solution involves allocated an array and then populating it one element at a time by taking the smallest head of your input lists, ie. a merge. Unfortunately neither <code>numpy</code> nor Python seem to have such a thing. The solution may be to write one in Cython. It would look vaguely like the following: <pre class="prettyprint"><code>def foo(numpy.ndarray[int, ndim=1] out, numpy.ndarray[int, ndim=1] in1, numpy.ndarray[int, ndim=1] in2): cdef int i = 0 cdef int j = 0 cdef int k = 0 while (i!=len(in1)) or (j!=len(in2)): # set out[k] to smaller of in[i] or in[j] # increment k # increment one of i or j </code></pre>

combine two arrays and sort

Tags:

python

numpy

Given two sorted arrays like the following:

a = array([1,2,4,5,6,8,9])

b = array([3,4,7,10])

I would like the output to be:

c = array([1,2,3,4,5,6,7,8,9,10])

or:

c = array([1,2,3,4,4,5,6,7,8,9,10])

I'm aware that I can do the following:

c = unique(concatenate((a,b))

I'm just wondering if there is a faster way to do it as the arrays I'm dealing with have millions of elements.

Any idea is welcomed. Thanks

648

asked Sep 14 '12 15:09

Jun

2 Answers

Since you use numpy, I doubt that bisec helps you at all... So instead I would suggest two smaller things:

Do not use np.sort, use c.sort() method instead which sorts the array in place and avoids the copy.
np.unique must use np.sort which is not in place. So instead of using np.unique do the logic by hand. IE. first sort (in-place) then do the np.unique method by hand (check also its python code), with flag = np.concatenate(([True], ar[1:] != ar[:-1])) with which unique = ar[flag] (with ar being sorted). To be a bit better, you should probably make the flag operation in place itself, ie. flag = np.ones(len(ar), dtype=bool) and then np.not_equal(ar[1:], ar[:-1], out=flag[1:]) which avoids basically one full copy of flag.
I am not sure about this. But .sort has 3 different algorithms, since your arrays maybe are almost sorted already, changing the sorting method might make a speed difference.

This would make the full thing close to what you got (without doing a unique beforehand):

def insort(a, b, kind='mergesort'):
    # took mergesort as it seemed a tiny bit faster for my sorted large array try.
    c = np.concatenate((a, b)) # we still need to do this unfortunatly.
    c.sort(kind=kind)
    flag = np.ones(len(c), dtype=bool)
    np.not_equal(c[1:], c[:-1], out=flag[1:])
    return c[flag]

142

answered Sep 21 '22 13:09

seberg

Inserting elements into the middle of an array is a very inefficient operation as they're flat in memory, so you'll need to shift everything along whenever you insert another element. As a result, you probably don't want to use bisect. The complexity of doing so would be around O(N^2).

Your current approach is O(n*log(n)), so that's a lot better, but it's not perfect.

Inserting all the elements into a hash table (such as a set) is something. That's going to take O(N) time for uniquify, but then you need to sort which will take O(n*log(n)). Still not great.

The real O(N) solution involves allocated an array and then populating it one element at a time by taking the smallest head of your input lists, ie. a merge. Unfortunately neither numpy nor Python seem to have such a thing. The solution may be to write one in Cython.

It would look vaguely like the following:

def foo(numpy.ndarray[int, ndim=1] out,
        numpy.ndarray[int, ndim=1] in1, 
        numpy.ndarray[int, ndim=1] in2):

        cdef int i = 0
        cdef int j = 0
        cdef int k = 0
        while (i!=len(in1)) or (j!=len(in2)):
            # set out[k] to smaller of in[i] or in[j]
            # increment k
            # increment one of i or j

answered Sep 18 '22 13:09

jleahy

Related questions
                            
                                Python not a standardized language?
                            
                                Do dicts preserve iteration order if they are not modified?
                            
                                Python: How to find if a path exists between 2 nodes in a graph?
                            
                                Cannot install psycopg2 on OSX 10.6.7 with XCode4
                            
                                Creating a raw HTTP request with sockets
                            
                                Print results in MySQL format with Python
                            
                                SQLite foreign key examples
                            
                                Django Import Error: No module named apps
                            
                                pip install dryscrape fails with "error: [Errno 2] No such file or directory: 'src/webkit_server'"?
                            
                                how to NOT read_csv if csv is empty
                            
                                Python scripts in /usr/bin
                            
                                Python not recognising directories os.path.isdir() [duplicate]
                            
                                How do I detect collision in pygame?
                            
                                Installed Nose but cannot use on command line
                            
                                How to configure Atom to run Python3 scripts?
                            
                                Django 2, python 3.4 cannot decode urlsafe_base64_decode(uidb64)
                            
                                Reading/Writing MS Word files in Python
                            
                                Search a list of strings for any sub-string from another list
                            
                                error: Setup script exited with error: command 'gcc' failed with exit status 1
                            
                                Scrapy - logging to file and stdout simultaneously, with spider names

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With