Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's a fast and pythonic/clean way of removing a sorted list from another sorted list in python?

I am creating a fast method of generating a list of primes in the range(0, limit+1). In the function I end up removing all integers in the list named removable from the list named primes. I am looking for a fast and pythonic way of removing the integers, knowing that both lists are always sorted.

I might be wrong, but I believe list.remove(n) iterates over the list comparing each element with n. meaning that the following code runs in O(n^2) time.

# removable and primes are both sorted lists of integers
for composite in removable:
    primes.remove(composite)

Based off my assumption (which could be wrong and please confirm whether or not this is correct) and the fact that both lists are always sorted, I would think that the following code runs faster, since it only loops over the list once for a O(n) time. However, it is not at all pythonic or clean.

i = 0
j = 0
while i < len(primes) and j < len(removable):
    if primes[i] == removable[j]:
        primes = primes[:i] + primes[i+1:]
        j += 1
    else:
        i += 1

Is there perhaps a built in function or simpler way of doing this? And what is the fastest way?

Side notes: I have not actually timed the functions or code above. Also, it doesn't matter if the list removable is changed/destroyed in the process.

For anyone interested the full functions is below:

import math

# returns a list of primes in range(0, limit+1)
def fastPrimeList(limit):
    if limit < 2:
        return list()
    sqrtLimit = int(math.ceil(math.sqrt(limit)))
    primes = [2] + range(3, limit+1, 2)
    index = 1
    while primes[index] <= sqrtLimit:
        removable = list()
        index2 = index
        while primes[index] * primes[index2] <= limit:
            composite = primes[index] * primes[index2]
            removable.append(composite)
            index2 += 1
        for composite in removable:
            primes.remove(composite)
        index += 1
    return primes
like image 376
David C Avatar asked Sep 23 '14 22:09

David C


2 Answers

This is quite fast and clean, it does O(n) set membership checks, and in amortized time it runs in O(n) (first line is O(n) amortized, second line is O(n * 1) amortized, because a membership check is O(1) amortized):

removable_set = set(removable)
primes = [p for p in primes if p not in removable_set]

Here is the modification of your 2nd solution. It does O(n) basic operations (worst case):

tmp = []
i = j = 0
while i < len(primes) and j < len(removable):
    if primes[i] < removable[j]:
        tmp.append(primes[i])
        i += 1
    elif primes[i] == removable[j]:
        i += 1
    else:
        j += 1
primes[:i] = tmp
del tmp

Please note that constants also matter. The Python interpreter is quite slow (i.e. with a large constant) to execute Python code. The 2nd solution has lots of Python code, and it can indeed be slower for small practical values of n than the solution with sets, because the set operations are implemented in C, thus they are fast (i.e. with a small constant).

If you have multiple working solutions, run them on typical input sizes, and measure the time. You may get surprised about their relative speed, often it is not what you would predict.

like image 80
pts Avatar answered Oct 24 '22 01:10

pts


The most important thing here is to remove the quadratic behavior. You have this for two reasons.

First, calling remove searches the entire list for values to remove. Doing this takes linear time, and you're doing it once for each element in removable, so your total time is O(NM) (where N is the length of primes and M is the length of removable).

Second, removing elements from the middle of a list forces you to shift the whole rest of the list up one slot. So, each one takes linear time, and again you're doing it M times, so again it's O(NM).


How can you avoid these?

For the first, you either need to take advantage of the sorting, or just use something that allows you to do constant-time lookups instead of linear-time, like a set.

For the second, you either need to create a list of indices to delete and then do a second pass to move each element up the appropriate number of indices all at once, or just build a new list instead of trying to mutate the original in-place.

So, there are a variety of options here. Which one is best? It almost certainly doesn't matter; changing your O(NM) time to just O(N+M) will probably be more than enough of an optimization that you're happy with the results. But if you need to squeeze out more performance, then you'll have to implement all of them and test them on realistic data.

The only one of these that I think isn't obvious is how to "use the sorting". The idea is to use the same kind of staggered-zip iteration that you'd use in a merge sort, like this:

def sorted_subtract(seq1, seq2):
    i1, i2 = 0, 0
    while i1 < len(seq1):
        if seq1[i1] != seq2[i2]:
            i2 += 1
            if i2 == len(seq2):
                yield from seq1[i1:]
                return
        else:
            yield seq1[i1]
            i1 += 1
like image 27
abarnert Avatar answered Oct 23 '22 23:10

abarnert