Assuming a sorted list of integers as below:
data = [1] * 3 + [4] * 5 + [5] * 2 + [9] * 3
# [1, 1, 1, 4, 4, 4, 4, 4, 5, 5, 9, 9, 9]
I want to find the indices where the values changes, i.e.
[3, 8, 10, 13]
One approach is to use itertools.groupby
:
cursor = 0
result = []
for key, group in groupby(data):
cursor += sum(1 for _ in group)
result.append(cursor)
print(result)
Output
[3, 8, 10, 13]
This approach is O(n). Another possible approach is to use bisect.bisect_left
:
cursor = 0
result = []
while cursor < len(data):
cursor = bisect_left(data, data[cursor] + 1, cursor, len(data))
result.append(cursor)
print(result)
Output
[3, 8, 10, 13]
This approach is O(k*log n) where k is the number of distinct elements. A variant of this approach is to use an exponential search.
Is there any faster or more performant way of doing this?
Step-by-step Binary Search Algorithm: We basically ignore half of the elements just after one comparison. Compare x with the middle element. If x matches with the middle element, we return the mid index. Else If x is greater than the mid element, then x can only lie in the right half subarray after the mid element.
You can use the python sorting functions' key parameter to sort the index array instead. perhaps key=s. __getitem__ ? In case you also want the sorted array, sorted_s = [s[k] for k in ind_s_to_sort] , where ind_s_to_sort is the index acquired from this method.
In order to reverse the original order of a list, you can use the reverse() method. The reverse() method is used to reverse the sequence of the list and not to arrange it in a sorted order like the sort() method. reverse() method reverses the sequence of the list permanently.
You sort the list by passing it to sorted and specifying a function to extract the sort key (the second element of each tuple; that's what the lambda is for. Finally, the original index of each sorted element is extracted using the [i[0] for i in ...] list comprehension.
In an array which is sorted in increasing order all the elements before val are less than val. So, to get the indices of val in a sorted array, instead of performing sort operation, we can simply count the elements less than val. If the count is x then val will start from x-th index (because of 0-based indexing).
Your task is to find the indices of val in the array after sorting the array in increasing order. Note: The indices must be in increasing order. Explanation: After sorting, arr [] becomes [1, 2, 2, 3, 5]. The indices where arr [i] = 2 are 1 and 2. As the indices should be in increasing order, that’s why they are (1, 2) and not (2, 1)
As the indices should be in increasing order, that’s why they are (1, 2) and not (2, 1) Explanation: After sorting, nums is [1, 2, 2, 3, 5]. The value 6 is not present in the array. Naive Approach: The concept of this approach is based on sorting. The array is sorted in increasing order. Then the sorted array is traversed.
Given an array arr [] of integers of size N and a target value val. Your task is to find the indices of val in the array after sorting the array in increasing order. Note: The indices must be in increasing order. Explanation: After sorting, arr [] becomes [1, 2, 2, 3, 5]. The indices where arr [i] = 2 are 1 and 2.
When it comes to asymptotic complexity I think you can improve slightly on the binary search on average when you apply a more evenly spread divide-and-conquer approach: try to first pinpoint the value-change that occurs closer to the middle of the input list, thereby splitting the range in approximately two halves, which would reduce the next binary search operation path by about one.
Yet, because this is Python, the gain might not be noticeable, because of the Python-code overhead (like for yield
, yield from
, the recursion, ...). It might even perform worse for the list sizes you work with:
from bisect import bisect_left
def locate(data, start, end):
if start >= end or data[start] == data[end - 1]:
return
mid = (start + end) // 2
val = data[mid]
if val == data[start]:
start = mid
val += 1
i = bisect_left(data, val, start + 1, end)
yield from locate(data, start, i)
yield i
yield from locate(data, i, end)
data = [1] * 3 + [4] * 5 + [5] * 2 + [9] * 3
print(*locate(data, 0, len(data))) # 3 8 10
Note that this only outputs valid indices, so 13 is not included for this example input.
I tested execution time of your approaches on two sets of data and added a third one using numpy
data1 = [1] * 30000000 + [2] * 30000000 + [4] * 50000000 + [5] * 20000000 + [7] * 40000000 + [9] * 30000000 + [11] * 10000000 + [15] * 30000000
data2 = list(range(10000000))
cursor = 0
result = []
start_time = time.time()
for key, group in groupby(data):
cursor += sum(1 for _ in group)
result.append(cursor)
print(f'groupby {time.time() - start_time} seconds')
cursor = 0
result = []
start_time = time.time()
while cursor < len(data):
cursor = bisect_left(data, data[cursor] + 1, cursor, len(data))
result.append(cursor)
print(f'bisect_left {time.time() - start_time} seconds')
data = np.array(data)
start_time = time.time()
[i + 1 for i in np.where(data[:-1] != data[1:])[0]] + [len(data)]
print(f'numpy {time.time() - start_time} seconds')
# We need to iterate over the results array to add 1 to each index for your expected results.
With data1
groupby 8.864859104156494 seconds
bisect_left 0.0 seconds
numpy 0.27180027961730957 seconds
With data2
groupby 3.602466583251953 seconds
bisect_left 5.440978765487671 seconds
numpy 2.2847368717193604 seconds
As you mentioned bisect_left
is very much depends on the number of unique elements, but it seems using numpy
has better performance than itertools.groupby
even with the additional iteration on the indices list.
Since you said "I'm more interested in runtime", here are some faster replacements for cursor += sum(1 for _ in group)
of your groupby
solution.
Using @Guy's data1
but all segment lengths divided by 10:
normal optimized
original 870 ms 871 ms
zip_count 612 ms 611 ms
count_of 344 ms 345 ms
list_index 387 ms 386 ms
length_hint 223 ms 222 ms
Using list(range(1000000))
instead:
normal optimized
original 385 ms 331 ms
zip_count 463 ms 401 ms
count_of 197 ms 165 ms
list_index 175 ms 143 ms
length_hint 226 ms 127 ms
The optimized versions use more local variables or list comprehensions.
I don't think __length_hint__
is guaranteed to be exact, not even for a list iterator, but it appears to be (passes my correctness checks) and I don't see why it wouldn't.
The code (Try it online!, but you'll have to reduce something to not exceed the time limit):
from timeit import default_timer as timer
from itertools import groupby, count
from collections import deque
from operator import countOf
def original(data):
cursor = 0
result = []
for key, group in groupby(data):
cursor += sum(1 for _ in group)
result.append(cursor)
return result
def original_opti(data):
cursor = 0
sum_ = sum
return [cursor := cursor + sum_(1 for _ in group)
for _, group in groupby(data)]
def zip_count(data):
cursor = 0
result = []
for key, group in groupby(data):
index = count()
deque(zip(group, index), 0)
cursor += next(index)
result.append(cursor)
return result
def zip_count_opti(data):
cursor = 0
result = []
append = result.append
count_, deque_, zip_, next_ = count, deque, zip, next
for key, group in groupby(data):
index = count_()
deque_(zip_(group, index), 0)
cursor += next_(index)
append(cursor)
return result
def count_of(data):
cursor = 0
result = []
for key, group in groupby(data):
cursor += countOf(group, key)
result.append(cursor)
return result
def count_of_opti(data):
cursor = 0
countOf_ = countOf
result = [cursor := cursor + countOf_(group, key)
for key, group in groupby(data)]
return result
def list_index(data):
cursor = 0
result = []
for key, _ in groupby(data):
cursor = data.index(key, cursor)
result.append(cursor)
del result[0]
result.append(len(data))
return result
def list_index_opti(data):
cursor = 0
index = data.index
groups = groupby(data)
next(groups, None)
result = [cursor := index(key, cursor)
for key, _ in groups]
result.append(len(data))
return result
def length_hint(data):
result = []
it = iter(data)
for _ in groupby(it):
result.append(len(data) - (1 + it.__length_hint__()))
del result[0]
result.append(len(data))
return result
def length_hint_opti(data):
it = iter(data)
groups = groupby(it)
next(groups, None)
n_minus_1 = len(data) - 1
length_hint = it.__length_hint__
result = [n_minus_1 - length_hint()
for _ in groups]
result.append(len(data))
return result
funcss = [
(original, original_opti),
(zip_count, zip_count_opti),
(count_of, count_of_opti),
(list_index, list_index_opti),
(length_hint, length_hint_opti),
]
data1 = [1] * 3 + [2] * 3 + [4] * 5 + [5] * 2 + [7] * 4 + [9] * 3 + [11] * 1 + [15] * 3
data1 = [x for x in data1 for _ in range(1000000)]
data2 = list(range(1000000))
for _ in range(3):
for name in 'data1', 'data2':
print(name)
data = eval(name)
expect = None
for funcs in funcss:
print(f'{funcs[0].__name__:11}', end='')
for func in funcs:
times = []
for _ in range(5):
start_time = timer()
result = func(data)
end_time = timer()
times.append(end_time - start_time)
print(f'{round(min(times) * 1e3):5d} ms', end='')
if expect is None:
expect = result
else:
assert result == expect
print()
print()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With