Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get a list of all indices of repeated elements in a numpy array

Tags:

I'm trying to get the index of all repeated elements in a numpy array, but the solution I found for the moment is REALLY inefficient for a large (>20000 elements) input array (it takes more or less 9 seconds). The idea is simple:

  1. records_arrayis a numpy array of timestamps (datetime) from which we want to extract the indexes of repeated timestamps

  2. time_array is a numpy array containing all the timestamps that are repeated in records_array

  3. records is a django QuerySet (which can easily converted to a list) containing some Record objects. We want to create a list of couples formed by all possible combinations of tagId attributes of Record corresponding to the repeated timestamps found from records_array.

Here is the working (but inefficient) code I have for the moment:

tag_couples = []; for t in time_array:     users_inter = np.nonzero(records_array == t)[0] # Get all repeated timestamps in records_array for time t     l = [str(records[i].tagId) for i in users_inter] # Create a temporary list containing all tagIds recorded at time t     if l.count(l[0]) != len(l): #remove tuples formed by the first tag repeated         tag_couples +=[x for x in itertools.combinations(list(set(l)),2)] # Remove duplicates with list(set(l)) and append all possible couple combinations to tag_couples 

I'm quite sure this can be optimized by using Numpy, but I can't find a way to compare records_array with each element of time_array without using a for loop (this can't be compared by just using ==, since they are both arrays).

like image 371
morens Avatar asked May 02 '15 13:05

morens


People also ask

How do you find the index of a repeating element in an array?

int[] arr = {3,5,6,7,2,3,11,14 }; int index = Array. IndexOf(arr, 3); Console. WriteLine(index); Console. ReadLine();

How do I get unique values from an NP array?

The unique() function is used to find the unique elements of an array. Returns the sorted unique elements of an array. There are three optional outputs in addition to the unique elements: the indices of the input array that give the unique values.

How do you get the product of all of the elements in a NumPy array?

prod() in Python. numpy. prod() returns the product of array elements over a given axis.


2 Answers

A vectorized solution with numpy, on the magic of unique().

import numpy as np  # create a test array records_array = np.array([1, 2, 3, 1, 1, 3, 4, 3, 2])  # creates an array of indices, sorted by unique element idx_sort = np.argsort(records_array)  # sorts records array so all unique elements are together  sorted_records_array = records_array[idx_sort]  # returns the unique values, the index of the first occurrence of a value, and the count for each element vals, idx_start, count = np.unique(sorted_records_array, return_counts=True, return_index=True)  # splits the indices into separate arrays res = np.split(idx_sort, idx_start[1:])  #filter them with respect to their size, keeping only items occurring more than once vals = vals[count > 1] res = filter(lambda x: x.size > 1, res) 

The following code was the original answer, which required a bit more memory, using numpy broadcasting and calling unique twice:

records_array = array([1, 2, 3, 1, 1, 3, 4, 3, 2]) vals, inverse, count = unique(records_array, return_inverse=True,                               return_counts=True)  idx_vals_repeated = where(count > 1)[0] vals_repeated = vals[idx_vals_repeated]  rows, cols = where(inverse == idx_vals_repeated[:, newaxis]) _, inverse_rows = unique(rows, return_index=True) res = split(cols, inverse_rows[1:]) 

with as expected res = [array([0, 3, 4]), array([1, 8]), array([2, 5, 7])]

like image 83
gg349 Avatar answered Oct 06 '22 23:10

gg349


  • The answer is complicated, and dependent upon the size, and number of unique elements in the array.
  • The following:
    • Tests arrays with 2M elements, and up to 20k unique elements.
    • Tests arrays up to 80k elements, with a max of 20k unique elements
      • For arrays under 40k elements, the tests have up to half the unique elements as the size of the array (e.g. 10k elements would have up to 5k unique elements).

Arrays with 2M Elements

  • np.where is faster than defaultdict for up to about 200 unique elements, but slower than pandas.core.groupby.GroupBy.indices, and np.unique.
  • The solution using pandas, is the fastest solution for large arrays.

Arrays with up to 80k Elements

  • This is more situational, depending on the size of the array and the number of unique elements.
  • defaultdict is a fast option for arrays to about 2400 elements, especially with a large number of unique elements.
  • For arrays larger than 40k elements, and 20k unique elements, pandas is the fastest option.

%timeit

import random import numpy import pandas as pd from collections import defaultdict  def dd(l):     # default_dict test     indices = defaultdict(list)     for i, v in enumerate(l):         indices[v].append(i)     return indices   def npw(l):     # np_where test     return {v: np.where(l == v)[0] for v in np.unique(l)}   def uni(records_array):     # np_unique test     idx_sort = np.argsort(records_array)     sorted_records_array = records_array[idx_sort]     vals, idx_start, count = np.unique(sorted_records_array, return_counts=True, return_index=True)     res = np.split(idx_sort, idx_start[1:])     return dict(zip(vals, res))   def daf(l):     # pandas test     return pd.DataFrame(l).groupby([0]).indices   data = defaultdict(list)  for x in range(4, 20000, 100):  # number of unique elements     # create 2M element list     random.seed(365)     a = np.array([random.choice(range(x)) for _ in range(2000000)])          res1 = %timeit -r2 -n1 -q -o dd(a)     res2 = %timeit -r2 -n1 -q -o npw(a)     res3 = %timeit -r2 -n1 -q -o uni(a)     res4 = %timeit -r2 -n1 -q -o daf(a)          data['defaut_dict'].append(res1.average)     data['np_where'].append(res2.average)     data['np_unique'].append(res3.average)     data['pandas'].append(res4.average)     data['idx'].append(x)  df = pd.DataFrame(data) df.set_index('idx', inplace=True)  df.plot(figsize=(12, 5), xlabel='unique samples', ylabel='average time (s)', title='%timeit test: 2 run 1 loop each') plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left') plt.show() 

Tests with 2M elements

enter image description here

enter image description here

enter image description here

enter image description here

Tests with up to 80k elements

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

like image 33
Trenton McKinney Avatar answered Oct 06 '22 22:10

Trenton McKinney