Assuming a sorted list of integers as below: <pre class="prettyprint"><code>data = [1] * 3 + [4] * 5 + [5] * 2 + [9] * 3 # [1, 1, 1, 4, 4, 4, 4, 4, 5, 5, 9, 9, 9] </code></pre> I want to find the indices where the values changes, i.e. <pre class="prettyprint"><code>[3, 8, 10, 13] </code></pre> One approach is to use <code>itertools.groupby</code>: <pre class="prettyprint"><code>cursor = 0 result = [] for key, group in groupby(data): cursor += sum(1 for _ in group) result.append(cursor) print(result) </code></pre> Output <pre class="prettyprint"><code>[3, 8, 10, 13] </code></pre> This approach is O(n). Another possible approach is to use <code>bisect.bisect_left</code>: <pre class="prettyprint"><code>cursor = 0 result = [] while cursor < len(data): cursor = bisect_left(data, data[cursor] + 1, cursor, len(data)) result.append(cursor) print(result) </code></pre> Output <pre class="prettyprint"><code>[3, 8, 10, 13] </code></pre> This approach is O(k*log n) where k is the number of distinct elements. A variant of this approach is to use an exponential search. Is there any faster or more performant way of doing this?

When it comes to asymptotic complexity I think you can improve slightly on the binary search on average when you apply a more evenly spread divide-and-conquer approach: try to first pinpoint the value-change that occurs closer to the middle of the input list, thereby splitting the range in approximately two halves, which would reduce the next binary search operation path by about one. Yet, because this is Python, the gain might not be noticeable, because of the Python-code overhead (like for <code>yield</code>, <code>yield from</code>, the recursion, ...). It might even perform worse for the list sizes you work with: <pre class="prettyprint"><code>from bisect import bisect_left def locate(data, start, end): if start >= end or data[start] == data[end - 1]: return mid = (start + end) // 2 val = data[mid] if val == data[start]: start = mid val += 1 i = bisect_left(data, val, start + 1, end) yield from locate(data, start, i) yield i yield from locate(data, i, end) data = [1] * 3 + [4] * 5 + [5] * 2 + [9] * 3 print(*locate(data, 0, len(data))) # 3 8 10 </code></pre> Note that this only outputs valid indices, so 13 is not included for this example input.

Find the indices where a sorted list of integer changes

Tags:

python

algorithm

binary-search

Assuming a sorted list of integers as below:

data = [1] * 3 + [4] * 5 + [5] * 2 + [9] * 3
# [1, 1, 1, 4, 4, 4, 4, 4, 5, 5, 9, 9, 9]

I want to find the indices where the values changes, i.e.

[3, 8, 10, 13]

One approach is to use itertools.groupby:

cursor = 0
result = []
for key, group in groupby(data):
    cursor += sum(1 for _ in group)
    result.append(cursor)
print(result)

Output

[3, 8, 10, 13]

This approach is O(n). Another possible approach is to use bisect.bisect_left:

cursor = 0
result = []
while cursor < len(data):
    cursor = bisect_left(data, data[cursor] + 1, cursor, len(data))
    result.append(cursor)
print(result)

Output

[3, 8, 10, 13]

This approach is O(k*log n) where k is the number of distinct elements. A variant of this approach is to use an exponential search.

Is there any faster or more performant way of doing this?

811

asked Nov 23 '21 09:11

Dani Mesejo

Video Answer

3 Answers

When it comes to asymptotic complexity I think you can improve slightly on the binary search on average when you apply a more evenly spread divide-and-conquer approach: try to first pinpoint the value-change that occurs closer to the middle of the input list, thereby splitting the range in approximately two halves, which would reduce the next binary search operation path by about one.

Yet, because this is Python, the gain might not be noticeable, because of the Python-code overhead (like for yield, yield from, the recursion, ...). It might even perform worse for the list sizes you work with:

from bisect import bisect_left

def locate(data, start, end):
    if start >= end or data[start] == data[end - 1]:
        return
    mid = (start + end) // 2
    val = data[mid] 
    if val == data[start]:
        start = mid
        val += 1
    i = bisect_left(data, val, start + 1, end)
    yield from locate(data, start, i)
    yield i
    yield from locate(data, i, end)

data = [1] * 3 + [4] * 5 + [5] * 2 + [9] * 3
print(*locate(data, 0, len(data)))  # 3 8 10

Note that this only outputs valid indices, so 13 is not included for this example input.

169

answered Oct 22 '22 11:10

trincot

I tested execution time of your approaches on two sets of data and added a third one using numpy

data1 = [1] * 30000000 + [2] * 30000000 + [4] * 50000000 + [5] * 20000000 + [7] * 40000000 + [9] * 30000000 + [11] * 10000000 + [15] * 30000000
data2 = list(range(10000000))

cursor = 0
result = []
start_time = time.time()
for key, group in groupby(data):
    cursor += sum(1 for _ in group)
    result.append(cursor)
print(f'groupby {time.time() - start_time} seconds')

cursor = 0
result = []
start_time = time.time()
while cursor < len(data):
    cursor = bisect_left(data, data[cursor] + 1, cursor, len(data))
    result.append(cursor)
print(f'bisect_left {time.time() - start_time} seconds')

data = np.array(data)
start_time = time.time()
[i + 1 for i in np.where(data[:-1] != data[1:])[0]] + [len(data)]
print(f'numpy {time.time() - start_time} seconds')

# We need to iterate over the results array to add 1 to each index for your expected results.

With data1

groupby 8.864859104156494 seconds
bisect_left 0.0 seconds
numpy 0.27180027961730957 seconds

With data2

groupby 3.602466583251953 seconds
bisect_left 5.440978765487671 seconds
numpy 2.2847368717193604 seconds

As you mentioned bisect_left is very much depends on the number of unique elements, but it seems using numpy has better performance than itertools.groupby even with the additional iteration on the indices list.

answered Oct 22 '22 11:10

Guy

Since you said "I'm more interested in runtime", here are some faster replacements for cursor += sum(1 for _ in group) of your groupby solution.

Using @Guy's data1 but all segment lengths divided by 10:

             normal  optimized
original     870 ms  871 ms
zip_count    612 ms  611 ms
count_of     344 ms  345 ms
list_index   387 ms  386 ms
length_hint  223 ms  222 ms

Using list(range(1000000)) instead:

             normal  optimized
original     385 ms  331 ms
zip_count    463 ms  401 ms
count_of     197 ms  165 ms
list_index   175 ms  143 ms
length_hint  226 ms  127 ms

The optimized versions use more local variables or list comprehensions.

I don't think __length_hint__ is guaranteed to be exact, not even for a list iterator, but it appears to be (passes my correctness checks) and I don't see why it wouldn't.

The code (Try it online!, but you'll have to reduce something to not exceed the time limit):

from timeit import default_timer as timer
from itertools import groupby, count
from collections import deque
from operator import countOf

def original(data):
    cursor = 0
    result = []
    for key, group in groupby(data):
        cursor += sum(1 for _ in group)
        result.append(cursor)
    return result

def original_opti(data):
    cursor = 0
    sum_ = sum
    return [cursor := cursor + sum_(1 for _ in group)
            for _, group in groupby(data)]

def zip_count(data):
    cursor = 0
    result = []
    for key, group in groupby(data):
        index = count()
        deque(zip(group, index), 0)
        cursor += next(index)
        result.append(cursor)
    return result

def zip_count_opti(data):
    cursor = 0
    result = []
    append = result.append
    count_, deque_, zip_, next_ = count, deque, zip, next
    for key, group in groupby(data):
        index = count_()
        deque_(zip_(group, index), 0)
        cursor += next_(index)
        append(cursor)
    return result

def count_of(data):
    cursor = 0
    result = []
    for key, group in groupby(data):
        cursor += countOf(group, key)
        result.append(cursor)
    return result

def count_of_opti(data):
    cursor = 0
    countOf_ = countOf
    result = [cursor := cursor + countOf_(group, key)
              for key, group in groupby(data)]
    return result

def list_index(data):
    cursor = 0
    result = []
    for key, _ in groupby(data):
        cursor = data.index(key, cursor)
        result.append(cursor)
    del result[0]
    result.append(len(data))
    return result

def list_index_opti(data):
    cursor = 0
    index = data.index
    groups = groupby(data)
    next(groups, None)
    result = [cursor := index(key, cursor)
              for key, _ in groups]
    result.append(len(data))
    return result

def length_hint(data):
    result = []
    it = iter(data)
    for _ in groupby(it):
        result.append(len(data) - (1 + it.__length_hint__()))
    del result[0]
    result.append(len(data))
    return result

def length_hint_opti(data):
    it = iter(data)
    groups = groupby(it)
    next(groups, None)
    n_minus_1 = len(data) - 1
    length_hint = it.__length_hint__
    result = [n_minus_1 - length_hint()
              for _ in groups]
    result.append(len(data))
    return result

funcss = [
    (original, original_opti),
    (zip_count, zip_count_opti),
    (count_of, count_of_opti),
    (list_index, list_index_opti),
    (length_hint, length_hint_opti),
]

data1 = [1] * 3 + [2] * 3 + [4] * 5 + [5] * 2 + [7] * 4 + [9] * 3 + [11] * 1 + [15] * 3
data1 = [x for x in data1 for _ in range(1000000)]
data2 = list(range(1000000))

for _ in range(3):
    for name in 'data1', 'data2':
        print(name)
        data = eval(name)
        expect = None
        for funcs in funcss:
            print(f'{funcs[0].__name__:11}', end='')
            for func in funcs:
                times = []
                for _ in range(5):
                    start_time = timer()
                    result = func(data)
                    end_time = timer()
                    times.append(end_time - start_time)
                print(f'{round(min(times) * 1e3):5d} ms', end='')
                if expect is None:
                    expect = result
                else:
                    assert result == expect
            print()
        print()

answered Oct 22 '22 13:10

no comment

Related questions
                            
                                How to fix AttributeError in App Engine Flex when using grpc and cloud-datastore?
                            
                                UnboundLocalError: local variable 'e' referenced before assignment
                            
                                Python: How to remove default options on Typer CLI?
                            
                                Python pylint(raising-format-tuple) Exception arguments suggest string formatting might be intended
                            
                                Trying to add a colorbar to a Seaborn scatterplot
                            
                                ImportError: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory
                            
                                Get environment variables in a cloud function
                            
                                Why there is an unbound variable error warning by IDE in this simple python function
                            
                                __call__() missing 1 required positional argument: 'send' FastAPI on App Engine
                            
                                Why is str()+"" slower than ""+""
                            
                                How to split dataset to train, test and valid in Python? [duplicate]
                            
                                AttributeError: 'DataFrame' object has no attribute '_data'
                            
                                operator.index with custom class instance
                            
                                cannot import name 'secure_filename' from 'werkzeug'
                            
                                "RuntimeError: expected scalar type Double but found Float" in Pytorch CNN training
                            
                                how can I fix this WARNING in Xgboost?
                            
                                How to access environment secrets from a Github workflow?
                            
                                Denormalize/Restructure CDISC supplement data
                            
                                Can't login with instabot
                            
                                Why do keyword arguments to a class definition reappear after they were removed?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With