I have two sorted lists, e.g. <pre class="prettyprint lang-py prettyprint-override"><code>a = [1, 4, 7, 8] b = [1, 2, 3, 4, 5, 6] </code></pre> I want to know for each item in <code>a</code> if it is in <code>b</code>. For the above example, I want to find <pre class="prettyprint lang-py prettyprint-override"><code>a_in_b = [True, True, False, False] </code></pre> (or having the indices where <code>a_in_b</code> is <code>True</code> would be fine too). Now, both <code>a</code> and <code>b</code> are very large, so complexity is an issue. If <code>M = len(a)</code> and <code>N = len(b)</code>. How can I do this with a complexity lower than <code>M * O(N)</code> by making use of the fact that both lists are sorted?

You can iterate over your <code>b</code> list manually within a loop over <code>a</code>. You'll want to advance the <code>b</code> iteration when the latest value you've seen from it is less than the current value from <code>a</code>. <pre class="prettyprint"><code>from math import inf result = [] b_iter = iter(b) # create an iterator over b b_val = -inf for a_val in a: while b_val < a_val: b_val = next(b_iter, inf) # manually iterate on it result.append(a_val == b_val) </code></pre> This should have a running time of <code>O(M+N)</code>, since each list item gets iterated over at most once (<code>b</code> may not even need to be fully iterated). You could avoid using floating point infinities if you want to, but you'd need to do a bit of extra work to handle some edge cases (e.g. if <code>b</code> is empty).

Late answer, but a different approach to the problem using <code>set()</code> uniqueness and <code>O(1)</code> speed of <code>len()</code>, i. e. : <pre class="prettyprint"><code>a_in_b = [] a = [1,4,7,8] b = [1,2,3,4,5,6] b_set = set(b) for v in a: l1 = len(b_set) b_set.add(v) a_in_b.append(l1 == len(b_set)) </code></pre> <hr> Unfortunately, my approach isn't the fastest: <ul> <li>mistermiyagi: 0.387 ms</li> <li>tomerikoo: 0.442 ms</li> <li>blckknght: 0.729 ms</li> <li>lobito: 1.043 ms</li> <li>semisecure: 1.87 ms</li> <li>notnotparas: too long</li> <li>lucky6qi: too long</li> </ul> Benchmark

Use Binary Search here: <pre class="prettyprint lang-py prettyprint-override"><code>def bs(b,aele,start,end): if start > end: return False mid = (start + end) // 2 if ale == b[mid]: return True if ale < b[mid]: return bs(b, aele, start, mid-1) else: return bs(b, aele, mid+1, end) </code></pre> For each element in a check if it exists in b using this method. Time Complexity: O(m*log(n))

Using sets the order doesn't even matter. Turn <code>b</code> to a set (<code>O(N)</code>). Then iterate <code>a</code> (<code>O(M)</code>), and for each element check if it's in <code>set_b</code> (<code>O(1)</code>). This will give a time complexity of <code>O(max(M, N))</code>: <pre class="prettyprint lang-py prettyprint-override"><code>a = [1, 4, 7, 8] b = [1, 2, 3, 4, 5, 6] set_b = set(b) res = [] for elem in a: res.append(elem in set_b) </code></pre> This can of-course be shortened to a nifty list-comp: <pre class="prettyprint lang-py prettyprint-override"><code>res = [elem in set_b for elem in a] </code></pre> Both give: <pre class="prettyprint"><code>[True, True, False, False] </code></pre> <hr> For your parenthesized request, simply iterate with <code>enumerate</code> instead: <pre class="prettyprint lang-py prettyprint-override"><code>for i, elem in enumerate(a): if elem in set_b: res.append(i) </code></pre> Which will give <code>[0, 1]</code>.

You should use binary search algorithm(read about it if you don't know what it is). The modified <code>bin_search</code> function has to return position <code>right</code> for which <code>b[right] >= elem</code> - the first element in <code>b</code> that is greater or equal than searched element from <code>a</code>. This position will be used as the left position for next <code>bin_search</code> call. Also <code>bin_search</code> returns True as a second argument if it have found <code>elem</code> in b <pre class="prettyprint"><code>def bin_search(arr, elem, left): right = len(arr) while left < right: mid = (left+right)//2 if arr[mid] == elem: return (mid, True) if arr[mid] < elem: left = mid + 1 else: right = mid return (right, False) def find_a_in_b(a, b): new_left = 0 a_in_b = [False] * len(a) # we could have used enumerate but size of a is too large index = 0 for i in a: new_left, a_in_b[index] = bin_search(b, i, new_left) index += 1 return a_in_b </code></pre> It's probably the best time P.S. Forget it, i'm stupid and forgot about linear algorithm used in merge sort, so it's not the best

The obvious solution is actually <code>O(M + N)</code>: <pre class="prettyprint"><code>a = [1, 1, 4, 7, 8] b = [1, 2, 3, 4, 5, 6] c = [0] * len(a) # Or use a dict to stash hits .. j = 0 for i in range(0, len(a)): while j < len(b) - 1 and b[j] < a[i]: j += 1 if b[j] == a[i]: c[i] = 1 print(c) </code></pre> For each <code>i</code> in <code>0 ... N</code> where <code>N</code> is length of <code>a</code>, only a unique partition / sub-sequence of <code>b</code> plus one border number is checked, making it <code>O(M + N)</code> all together.

'in' for two sorted lists with the lowest complexity

Tags:

python

I have two sorted lists, e.g.

a = [1, 4, 7, 8]
b = [1, 2, 3, 4, 5, 6]

I want to know for each item in a if it is in b. For the above example, I want to find

a_in_b = [True, True, False, False]

(or having the indices where a_in_b is True would be fine too).

Now, both a and b are very large, so complexity is an issue. If M = len(a) and N = len(b). How can I do this with a complexity lower than M * O(N) by making use of the fact that both lists are sorted?

963

asked Jan 19 '21 09:01

Tom de Geus

Video Answer

8 Answers

You can iterate over your b list manually within a loop over a. You'll want to advance the b iteration when the latest value you've seen from it is less than the current value from a.

from math import inf

result = []
b_iter = iter(b)                           # create an iterator over b
b_val = -inf
for a_val in a:
    while b_val < a_val:
        b_val = next(b_iter, inf)          # manually iterate on it
    result.append(a_val == b_val)

This should have a running time of O(M+N), since each list item gets iterated over at most once (b may not even need to be fully iterated).

You could avoid using floating point infinities if you want to, but you'd need to do a bit of extra work to handle some edge cases (e.g. if b is empty).

152

answered Nov 03 '22 23:11

Blckknght

Exploiting sorted'ness is a red-herring for time complexity: The ideal case is to iterate both in lockstep for O(n+m) complexity. This is the same as converting b to a set for O(m), then searching the elements of a in the set for O(n).

>>> a = [1, 4, 7, 8]
>>> b = [1, 2, 3, 4, 5, 6]
>>> bs = set(b)                 # create set for O(len(b))
>>> [item in bs for item in a]  # check O(len(a)) items "in set of b" for O(1) each
[True, True, False, False]

Since most of these operations are builtin, the only costly operation is the iteration over a which is needed in all solutions.

However, this will duplicate the references to the items in b. If b is treated as external to the algorithm, the space complexity is O(m+n) instead of the ideal case O(n) for just the answer.

answered Nov 04 '22 00:11

MisterMiyagi

Late answer, but a different approach to the problem using set() uniqueness and O(1) speed of len(), i. e. :

a_in_b = []
a = [1,4,7,8]
b = [1,2,3,4,5,6]
b_set = set(b) 
for v in a:
    l1 = len(b_set) 
    b_set.add(v) 
    a_in_b.append(l1 == len(b_set))

Unfortunately, my approach isn't the fastest:

mistermiyagi: 0.387 ms
tomerikoo: 0.442 ms
blckknght: 0.729 ms
lobito: 1.043 ms
semisecure: 1.87 ms
notnotparas: too long
lucky6qi: too long

Benchmark

answered Nov 03 '22 23:11

Pedro Lobito

Use Binary Search here:

def bs(b,aele,start,end):
    if start > end:
        return False
    mid = (start + end) // 2
    if ale == b[mid]:
        return True

    if ale < b[mid]:
        return bs(b, aele, start, mid-1)
    else:
        return bs(b, aele, mid+1, end)

For each element in a check if it exists in b using this method. Time Complexity: O(m*log(n))

answered Nov 03 '22 23:11

notnotparas

Using sets the order doesn't even matter.

Turn b to a set (O(N)). Then iterate a (O(M)), and for each element check if it's in set_b (O(1)). This will give a time complexity of O(max(M, N)):

a = [1, 4, 7, 8]
b = [1, 2, 3, 4, 5, 6]

set_b = set(b)
res = []
for elem in a:
    res.append(elem in set_b)

This can of-course be shortened to a nifty list-comp:

res = [elem in set_b for elem in a]

Both give:

[True, True, False, False]

For your parenthesized request, simply iterate with enumerate instead:

for i, elem in enumerate(a):
    if elem in set_b:
        res.append(i)

Which will give [0, 1].

answered Nov 03 '22 22:11

Tomerikoo

You should use binary search algorithm(read about it if you don't know what it is).

The modified bin_search function has to return position right for which b[right] >= elem - the first element in b that is greater or equal than searched element from a. This position will be used as the left position for next bin_search call. Also bin_search returns True as a second argument if it have found elem in b

def bin_search(arr, elem, left):
    right = len(arr)
    while left < right:
        mid = (left+right)//2
        if arr[mid] == elem:
            return (mid, True)
        if arr[mid] < elem:
            left = mid + 1
        else:
            right = mid
    return (right, False)

def find_a_in_b(a, b):
    new_left = 0
    a_in_b = [False] * len(a)
    
    # we could have used enumerate but size of a is too large
    index = 0
    for i in a:
        new_left, a_in_b[index] = bin_search(b, i, new_left)
        index += 1
    return a_in_b

It's probably the best time

P.S. Forget it, i'm stupid and forgot about linear algorithm used in merge sort, so it's not the best

answered Nov 04 '22 00:11

Илья Кузнецов

Go through a and b once:

a_in_b = []
bstart = 0
for ai in a:
    print (ai,bstart)
    if bstart == len(b):
        a_in_b.append(False)
    else:
        for bi in b[bstart:]:
            print (ai, bi, bstart)
            if ai == bi:
                a_in_b.append(True)
                break
            elif ai > bi:
                if bstart < len(b):
                    bstart+=1
                if bstart == len(b):
                    a_in_b.append(False)
                continue

answered Nov 03 '22 22:11

lucky6qi

The obvious solution is actually O(M + N):

a = [1, 1, 4, 7, 8]
b = [1, 2, 3, 4, 5, 6]
c = [0] * len(a) # Or use a dict to stash hits ..

j = 0

for i in range(0, len(a)):
  while j < len(b) - 1 and b[j] < a[i]:
    j += 1
  if b[j] == a[i]:
    c[i] = 1

print(c)

For each i in 0 ... N where N is length of a, only a unique partition / sub-sequence of b plus one border number is checked, making it O(M + N) all together.

answered Nov 03 '22 23:11

spinkus

Related questions
                            
                                What is a vectorized way to create multiple powers of a NumPy array?
                            
                                Change the regression line colour of Seaborn's pairplot
                            
                                Content-length header not the same as when manually calculating it?
                            
                                How to flash success and danger with different messages in flask.
                            
                                Failed to start the kernel on jupyter notebook
                            
                                Difference between tf.layers.conv2d and tf.contrib.slim.conv2d
                            
                                Fastest way to check if a number is divisible by another in python
                            
                                IPython.embed() does not use terminal colors
                            
                                How return list on Python and Flask? [duplicate]
                            
                                How to silence EllipticCurvePublicNumbers.encode_point CryptographyDeprecationWarning when using Paramiko in Python
                            
                                How to do weight initialization by Xavier rule in Tensorflow 2.0?
                            
                                Django WhiteNoise configuration is incompatible with WhiteNoise v4.0
                            
                                How to debug Django custom management command using VS Code
                            
                                Reorder certain columns in pandas dataframe
                            
                                This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical
                            
                                what does dim=-1 or -2 mean in torch.sum()?
                            
                                python-Binance api: APIError(code=-1013): Filter failure: LOT_SIZE
                            
                                Combining strings and ints to create a date string results in TypeError
                            
                                How can i can send windows 10 notifications with python that has a button on the notification
                            
                                Python While loop breakout issues

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

'in' for two sorted lists with the lowest complexity

Tags:

python

Tom de Geus

People also ask

Video Answer

8 Answers

Blckknght

MisterMiyagi

Pedro Lobito

notnotparas

Tomerikoo

Илья Кузнецов

lucky6qi

spinkus

Recent Activity

Donate For Us