Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the indices where a sorted list of integer changes

Assuming a sorted list of integers as below:

data = [1] * 3 + [4] * 5 + [5] * 2 + [9] * 3
# [1, 1, 1, 4, 4, 4, 4, 4, 5, 5, 9, 9, 9]

I want to find the indices where the values changes, i.e.

[3, 8, 10, 13]

One approach is to use itertools.groupby:

cursor = 0
result = []
for key, group in groupby(data):
    cursor += sum(1 for _ in group)
    result.append(cursor)
print(result)

Output

[3, 8, 10, 13]

This approach is O(n). Another possible approach is to use bisect.bisect_left:

cursor = 0
result = []
while cursor < len(data):
    cursor = bisect_left(data, data[cursor] + 1, cursor, len(data))
    result.append(cursor)
print(result)

Output

[3, 8, 10, 13]

This approach is O(k*log n) where k is the number of distinct elements. A variant of this approach is to use an exponential search.

Is there any faster or more performant way of doing this?

like image 811
Dani Mesejo Avatar asked Nov 23 '21 09:11

Dani Mesejo


People also ask

How do you find the index of a binary search element?

Step-by-step Binary Search Algorithm: We basically ignore half of the elements just after one comparison. Compare x with the middle element. If x matches with the middle element, we return the mid index. Else If x is greater than the mid element, then x can only lie in the right half subarray after the mid element.

How do I return an index to a sorted list in Python?

You can use the python sorting functions' key parameter to sort the index array instead. perhaps key=s. __getitem__ ? In case you also want the sorted array, sorted_s = [s[k] for k in ind_s_to_sort] , where ind_s_to_sort is the index acquired from this method.

How do I sort a Python list in reverse order?

In order to reverse the original order of a list, you can use the reverse() method. The reverse() method is used to reverse the sequence of the list and not to arrange it in a sorted order like the sort() method. reverse() method reverses the sequence of the list permanently.

How do you find the index of a sorted array?

You sort the list by passing it to sorted and specifying a function to extract the sort key (the second element of each tuple; that's what the lambda is for. Finally, the original index of each sorted element is extracted using the [i[0] for i in ...] list comprehension.

How do you find the index of a sorted array?

In an array which is sorted in increasing order all the elements before val are less than val. So, to get the indices of val in a sorted array, instead of performing sort operation, we can simply count the elements less than val. If the count is x then val will start from x-th index (because of 0-based indexing).

What are the indices of Val in the array after sorting?

Your task is to find the indices of val in the array after sorting the array in increasing order. Note: The indices must be in increasing order. Explanation: After sorting, arr [] becomes [1, 2, 2, 3, 5]. The indices where arr [i] = 2 are 1 and 2. As the indices should be in increasing order, that’s why they are (1, 2) and not (2, 1)

Why are the indices of an array always in increasing order?

As the indices should be in increasing order, that’s why they are (1, 2) and not (2, 1) Explanation: After sorting, nums is [1, 2, 2, 3, 5]. The value 6 is not present in the array. Naive Approach: The concept of this approach is based on sorting. The array is sorted in increasing order. Then the sorted array is traversed.

What are the indices where arr[i] = 2?

Given an array arr [] of integers of size N and a target value val. Your task is to find the indices of val in the array after sorting the array in increasing order. Note: The indices must be in increasing order. Explanation: After sorting, arr [] becomes [1, 2, 2, 3, 5]. The indices where arr [i] = 2 are 1 and 2.


Video Answer


3 Answers

When it comes to asymptotic complexity I think you can improve slightly on the binary search on average when you apply a more evenly spread divide-and-conquer approach: try to first pinpoint the value-change that occurs closer to the middle of the input list, thereby splitting the range in approximately two halves, which would reduce the next binary search operation path by about one.

Yet, because this is Python, the gain might not be noticeable, because of the Python-code overhead (like for yield, yield from, the recursion, ...). It might even perform worse for the list sizes you work with:

from bisect import bisect_left

def locate(data, start, end):
    if start >= end or data[start] == data[end - 1]:
        return
    mid = (start + end) // 2
    val = data[mid] 
    if val == data[start]:
        start = mid
        val += 1
    i = bisect_left(data, val, start + 1, end)
    yield from locate(data, start, i)
    yield i
    yield from locate(data, i, end)

data = [1] * 3 + [4] * 5 + [5] * 2 + [9] * 3
print(*locate(data, 0, len(data)))  # 3 8 10

Note that this only outputs valid indices, so 13 is not included for this example input.

like image 169
trincot Avatar answered Oct 22 '22 11:10

trincot


I tested execution time of your approaches on two sets of data and added a third one using numpy

data1 = [1] * 30000000 + [2] * 30000000 + [4] * 50000000 + [5] * 20000000 + [7] * 40000000 + [9] * 30000000 + [11] * 10000000 + [15] * 30000000
data2 = list(range(10000000))

cursor = 0
result = []
start_time = time.time()
for key, group in groupby(data):
    cursor += sum(1 for _ in group)
    result.append(cursor)
print(f'groupby {time.time() - start_time} seconds')

cursor = 0
result = []
start_time = time.time()
while cursor < len(data):
    cursor = bisect_left(data, data[cursor] + 1, cursor, len(data))
    result.append(cursor)
print(f'bisect_left {time.time() - start_time} seconds')

data = np.array(data)
start_time = time.time()
[i + 1 for i in np.where(data[:-1] != data[1:])[0]] + [len(data)]
print(f'numpy {time.time() - start_time} seconds')

# We need to iterate over the results array to add 1 to each index for your expected results.

With data1

groupby 8.864859104156494 seconds
bisect_left 0.0 seconds
numpy 0.27180027961730957 seconds

With data2

groupby 3.602466583251953 seconds
bisect_left 5.440978765487671 seconds
numpy 2.2847368717193604 seconds

As you mentioned bisect_left is very much depends on the number of unique elements, but it seems using numpy has better performance than itertools.groupby even with the additional iteration on the indices list.

like image 35
Guy Avatar answered Oct 22 '22 11:10

Guy


Since you said "I'm more interested in runtime", here are some faster replacements for cursor += sum(1 for _ in group) of your groupby solution.

Using @Guy's data1 but all segment lengths divided by 10:

             normal  optimized
original     870 ms  871 ms
zip_count    612 ms  611 ms
count_of     344 ms  345 ms
list_index   387 ms  386 ms
length_hint  223 ms  222 ms

Using list(range(1000000)) instead:

             normal  optimized
original     385 ms  331 ms
zip_count    463 ms  401 ms
count_of     197 ms  165 ms
list_index   175 ms  143 ms
length_hint  226 ms  127 ms

The optimized versions use more local variables or list comprehensions.

I don't think __length_hint__ is guaranteed to be exact, not even for a list iterator, but it appears to be (passes my correctness checks) and I don't see why it wouldn't.

The code (Try it online!, but you'll have to reduce something to not exceed the time limit):

from timeit import default_timer as timer
from itertools import groupby, count
from collections import deque
from operator import countOf

def original(data):
    cursor = 0
    result = []
    for key, group in groupby(data):
        cursor += sum(1 for _ in group)
        result.append(cursor)
    return result

def original_opti(data):
    cursor = 0
    sum_ = sum
    return [cursor := cursor + sum_(1 for _ in group)
            for _, group in groupby(data)]

def zip_count(data):
    cursor = 0
    result = []
    for key, group in groupby(data):
        index = count()
        deque(zip(group, index), 0)
        cursor += next(index)
        result.append(cursor)
    return result

def zip_count_opti(data):
    cursor = 0
    result = []
    append = result.append
    count_, deque_, zip_, next_ = count, deque, zip, next
    for key, group in groupby(data):
        index = count_()
        deque_(zip_(group, index), 0)
        cursor += next_(index)
        append(cursor)
    return result

def count_of(data):
    cursor = 0
    result = []
    for key, group in groupby(data):
        cursor += countOf(group, key)
        result.append(cursor)
    return result

def count_of_opti(data):
    cursor = 0
    countOf_ = countOf
    result = [cursor := cursor + countOf_(group, key)
              for key, group in groupby(data)]
    return result

def list_index(data):
    cursor = 0
    result = []
    for key, _ in groupby(data):
        cursor = data.index(key, cursor)
        result.append(cursor)
    del result[0]
    result.append(len(data))
    return result

def list_index_opti(data):
    cursor = 0
    index = data.index
    groups = groupby(data)
    next(groups, None)
    result = [cursor := index(key, cursor)
              for key, _ in groups]
    result.append(len(data))
    return result

def length_hint(data):
    result = []
    it = iter(data)
    for _ in groupby(it):
        result.append(len(data) - (1 + it.__length_hint__()))
    del result[0]
    result.append(len(data))
    return result

def length_hint_opti(data):
    it = iter(data)
    groups = groupby(it)
    next(groups, None)
    n_minus_1 = len(data) - 1
    length_hint = it.__length_hint__
    result = [n_minus_1 - length_hint()
              for _ in groups]
    result.append(len(data))
    return result

funcss = [
    (original, original_opti),
    (zip_count, zip_count_opti),
    (count_of, count_of_opti),
    (list_index, list_index_opti),
    (length_hint, length_hint_opti),
]

data1 = [1] * 3 + [2] * 3 + [4] * 5 + [5] * 2 + [7] * 4 + [9] * 3 + [11] * 1 + [15] * 3
data1 = [x for x in data1 for _ in range(1000000)]
data2 = list(range(1000000))

for _ in range(3):
    for name in 'data1', 'data2':
        print(name)
        data = eval(name)
        expect = None
        for funcs in funcss:
            print(f'{funcs[0].__name__:11}', end='')
            for func in funcs:
                times = []
                for _ in range(5):
                    start_time = timer()
                    result = func(data)
                    end_time = timer()
                    times.append(end_time - start_time)
                print(f'{round(min(times) * 1e3):5d} ms', end='')
                if expect is None:
                    expect = result
                else:
                    assert result == expect
            print()
        print()
like image 27
no comment Avatar answered Oct 22 '22 13:10

no comment