I have a 100000000x2 array named "a", with an index in the first column and a related value in the second column. I need to get the median values of the numbers in the second column for each index. This is how I colud do it with a for statement:
import numpy as np
b = np.zeros(1000000)
a = np.array([[1, 2],
[1, 3],
[2, 3],
[2, 4],
[2, 6],
[1, 4],
...
...
[1000000,6]])
for i in xrange(1000000):
b[i]=np.median(a[np.where(a[:,0]==i),1])
Obviously it's too slow with the for iteration: any suggestions? Thanks
Using for (var property in array) will cause array to be iterated over as an object, traversing the object prototype chain and ultimately performing slower than an index-based for loop. for (... in ...) is not guaranteed to return the object properties in sequential order, as one might expect.
have an hidden element say an input. set the value of it inside the loop with your value desired. call the change event along for the input element. Add a event listener for the change of input and get that value which is obviously outside the loop.
For Loop to Traverse Arrays. We can use iteration with a for loop to visit each element of an array. This is called traversing the array. Just start the index at 0 and loop while the index is less than the length of the array.
This is known as a "group by" operation. Pandas (http://pandas.pydata.org/) is a good tool for this:
import numpy as np
import pandas as pd
a = np.array([[1.0, 2.0],
[1.0, 3.0],
[2.0, 5.0],
[2.0, 6.0],
[2.0, 8.0],
[1.0, 4.0],
[1.0, 1.0],
[1.0, 3.5],
[5.0, 8.0],
[2.0, 1.0],
[5.0, 9.0]])
# Create the pandas DataFrame.
df = pd.DataFrame(a, columns=['index', 'value'])
# Form the groups.
grouped = df.groupby('index')
# `result` is the DataFrame containing the aggregated results.
result = grouped.aggregate(np.median)
print result
Output:
value
index
1 3.0
2 5.5
5 8.5
There are ways to create the DataFrame containing the original data directly, so you don't necessarily have to create the numpy array a
first.
More information about the groupby operation in Pandas: http://pandas.pydata.org/pandas-docs/dev/groupby.html
This is a little bit annoying to do, but at least you can remove that annoying ==
easily, using sorting (and thats probably your speed killer). Trying more is probably not very useful, though it might be possible if you sort yourself, etc.:
# First sor the whole thing (probably other ways):
sorter = np.argsort(a[:,0]) # sort by class.
a = a[sorter] # sorted version of a
# Now we need to find where there are changes in the class:
w = np.where(a[:-1,0] != a[1:,0])[0] + 1 # Where the class changes.
# for simplicity, append [0] and [len(a)] to have full slices...
w = np.concatenate([0], w, [len(a)])
result = np.zeros(len(w)-1, dtype=a.dtype)
for i in xrange(0, len(w)-1):
result[0] = np.median(a[w[i]:w[i+1]])
# If the classes are not exactly 1, 2, ..., N we could add class information:
classes = a[w[:-1],0]
If all your classes are the same size, so there are exactly as many 1s, as 2s, etc. There are better ways though.
EDIT: Check Bitwises version for a solution to avoid the last for loop as well (he also hides some of this code into np.unique
which you may prefere, since speed should not matter for that anyways).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With