I'm trying to get the index of all repeated elements in a numpy array, but the solution I found for the moment is REALLY inefficient for a large (>20000 elements) input array (it takes more or less 9 seconds). The idea is simple:
records_array
is a numpy array of timestamps (datetime
) from which we want to extract the indexes of repeated timestamps
time_array
is a numpy array containing all the timestamps that are repeated in records_array
records
is a django QuerySet (which can easily converted to a list) containing some Record objects. We want to create a list of couples formed by all possible combinations of tagId attributes of Record corresponding to the repeated timestamps found from records_array
.
Here is the working (but inefficient) code I have for the moment:
tag_couples = []; for t in time_array: users_inter = np.nonzero(records_array == t)[0] # Get all repeated timestamps in records_array for time t l = [str(records[i].tagId) for i in users_inter] # Create a temporary list containing all tagIds recorded at time t if l.count(l[0]) != len(l): #remove tuples formed by the first tag repeated tag_couples +=[x for x in itertools.combinations(list(set(l)),2)] # Remove duplicates with list(set(l)) and append all possible couple combinations to tag_couples
I'm quite sure this can be optimized by using Numpy, but I can't find a way to compare records_array
with each element of time_array
without using a for loop (this can't be compared by just using ==
, since they are both arrays).
int[] arr = {3,5,6,7,2,3,11,14 }; int index = Array. IndexOf(arr, 3); Console. WriteLine(index); Console. ReadLine();
The unique() function is used to find the unique elements of an array. Returns the sorted unique elements of an array. There are three optional outputs in addition to the unique elements: the indices of the input array that give the unique values.
prod() in Python. numpy. prod() returns the product of array elements over a given axis.
A vectorized solution with numpy, on the magic of unique()
.
import numpy as np # create a test array records_array = np.array([1, 2, 3, 1, 1, 3, 4, 3, 2]) # creates an array of indices, sorted by unique element idx_sort = np.argsort(records_array) # sorts records array so all unique elements are together sorted_records_array = records_array[idx_sort] # returns the unique values, the index of the first occurrence of a value, and the count for each element vals, idx_start, count = np.unique(sorted_records_array, return_counts=True, return_index=True) # splits the indices into separate arrays res = np.split(idx_sort, idx_start[1:]) #filter them with respect to their size, keeping only items occurring more than once vals = vals[count > 1] res = filter(lambda x: x.size > 1, res)
The following code was the original answer, which required a bit more memory, using numpy
broadcasting and calling unique
twice:
records_array = array([1, 2, 3, 1, 1, 3, 4, 3, 2]) vals, inverse, count = unique(records_array, return_inverse=True, return_counts=True) idx_vals_repeated = where(count > 1)[0] vals_repeated = vals[idx_vals_repeated] rows, cols = where(inverse == idx_vals_repeated[:, newaxis]) _, inverse_rows = unique(rows, return_index=True) res = split(cols, inverse_rows[1:])
with as expected res = [array([0, 3, 4]), array([1, 8]), array([2, 5, 7])]
np.where
is faster than defaultdict
for up to about 200 unique elements, but slower than pandas.core.groupby.GroupBy.indices
, and np.unique
.pandas
, is the fastest solution for large arrays.defaultdict
is a fast option for arrays to about 2400 elements, especially with a large number of unique elements.%timeit
import random import numpy import pandas as pd from collections import defaultdict def dd(l): # default_dict test indices = defaultdict(list) for i, v in enumerate(l): indices[v].append(i) return indices def npw(l): # np_where test return {v: np.where(l == v)[0] for v in np.unique(l)} def uni(records_array): # np_unique test idx_sort = np.argsort(records_array) sorted_records_array = records_array[idx_sort] vals, idx_start, count = np.unique(sorted_records_array, return_counts=True, return_index=True) res = np.split(idx_sort, idx_start[1:]) return dict(zip(vals, res)) def daf(l): # pandas test return pd.DataFrame(l).groupby([0]).indices data = defaultdict(list) for x in range(4, 20000, 100): # number of unique elements # create 2M element list random.seed(365) a = np.array([random.choice(range(x)) for _ in range(2000000)]) res1 = %timeit -r2 -n1 -q -o dd(a) res2 = %timeit -r2 -n1 -q -o npw(a) res3 = %timeit -r2 -n1 -q -o uni(a) res4 = %timeit -r2 -n1 -q -o daf(a) data['defaut_dict'].append(res1.average) data['np_where'].append(res2.average) data['np_unique'].append(res3.average) data['pandas'].append(res4.average) data['idx'].append(x) df = pd.DataFrame(data) df.set_index('idx', inplace=True) df.plot(figsize=(12, 5), xlabel='unique samples', ylabel='average time (s)', title='%timeit test: 2 run 1 loop each') plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left') plt.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With