Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dealing with arrays: how to avoid a "for" statement

I have a 100000000x2 array named "a", with an index in the first column and a related value in the second column. I need to get the median values of the numbers in the second column for each index. This is how I colud do it with a for statement:

import numpy as np
b = np.zeros(1000000)
a = np.array([[1, 2],
              [1, 3],
              [2, 3],
              [2, 4],
              [2, 6],
              [1, 4],
              ...
              ...
              [1000000,6]])
for i in xrange(1000000):
    b[i]=np.median(a[np.where(a[:,0]==i),1])

Obviously it's too slow with the for iteration: any suggestions? Thanks

like image 404
andreaconsole Avatar asked Sep 25 '12 20:09

andreaconsole


People also ask

Can we use for in for arrays?

Using for (var property in array) will cause array to be iterated over as an object, traversing the object prototype chain and ultimately performing slower than an index-based for loop. for (... in ...) is not guaranteed to return the object properties in sequential order, as one might expect.

How do you take the value out of a for loop?

have an hidden element say an input. set the value of it inside the loop with your value desired. call the change event along for the input element. Add a event listener for the change of input and get that value which is obviously outside the loop.

Can you use for in loops for arrays?

For Loop to Traverse Arrays. We can use iteration with a for loop to visit each element of an array. This is called traversing the array. Just start the index at 0 and loop while the index is less than the length of the array.


2 Answers

This is known as a "group by" operation. Pandas (http://pandas.pydata.org/) is a good tool for this:

import numpy as np
import pandas as pd

a = np.array([[1.0, 2.0],
              [1.0, 3.0],
              [2.0, 5.0],
              [2.0, 6.0],
              [2.0, 8.0],
              [1.0, 4.0],
              [1.0, 1.0],
              [1.0, 3.5],
              [5.0, 8.0],
              [2.0, 1.0],
              [5.0, 9.0]])

# Create the pandas DataFrame.
df = pd.DataFrame(a, columns=['index', 'value'])

# Form the groups.
grouped = df.groupby('index')

# `result` is the DataFrame containing the aggregated results.
result = grouped.aggregate(np.median)
print result

Output:

       value
index       
1        3.0
2        5.5
5        8.5

There are ways to create the DataFrame containing the original data directly, so you don't necessarily have to create the numpy array a first.

More information about the groupby operation in Pandas: http://pandas.pydata.org/pandas-docs/dev/groupby.html

like image 104
Warren Weckesser Avatar answered Sep 29 '22 13:09

Warren Weckesser


This is a little bit annoying to do, but at least you can remove that annoying == easily, using sorting (and thats probably your speed killer). Trying more is probably not very useful, though it might be possible if you sort yourself, etc.:

# First sor the whole thing (probably other ways):
sorter = np.argsort(a[:,0]) # sort by class.
a = a[sorter] # sorted version of a

# Now we need to find where there are changes in the class:
w = np.where(a[:-1,0] != a[1:,0])[0] + 1 # Where the class changes.
# for simplicity, append [0] and [len(a)] to have full slices...
w = np.concatenate([0], w, [len(a)])
result = np.zeros(len(w)-1, dtype=a.dtype)
for i in xrange(0, len(w)-1):
    result[0] = np.median(a[w[i]:w[i+1]])

# If the classes are not exactly 1, 2, ..., N we could add class information:
classes = a[w[:-1],0]

If all your classes are the same size, so there are exactly as many 1s, as 2s, etc. There are better ways though.

EDIT: Check Bitwises version for a solution to avoid the last for loop as well (he also hides some of this code into np.unique which you may prefere, since speed should not matter for that anyways).

like image 21
seberg Avatar answered Sep 29 '22 11:09

seberg