dealing with arrays: how to avoid a "for" statement

Tags:

I have a 100000000x2 array named "a", with an index in the first column and a related value in the second column. I need to get the median values of the numbers in the second column for each index. This is how I colud do it with a for statement:

import numpy as np
b = np.zeros(1000000)
a = np.array([[1, 2],
              [1, 3],
              [2, 3],
              [2, 4],
              [2, 6],
              [1, 4],
              ...
              ...
              [1000000,6]])
for i in xrange(1000000):
    b[i]=np.median(a[np.where(a[:,0]==i),1])

Obviously it's too slow with the for iteration: any suggestions? Thanks

404

asked Sep 25 '12 20:09

andreaconsole

2 Answers

This is known as a "group by" operation. Pandas (http://pandas.pydata.org/) is a good tool for this:

import numpy as np
import pandas as pd

a = np.array([[1.0, 2.0],
              [1.0, 3.0],
              [2.0, 5.0],
              [2.0, 6.0],
              [2.0, 8.0],
              [1.0, 4.0],
              [1.0, 1.0],
              [1.0, 3.5],
              [5.0, 8.0],
              [2.0, 1.0],
              [5.0, 9.0]])

# Create the pandas DataFrame.
df = pd.DataFrame(a, columns=['index', 'value'])

# Form the groups.
grouped = df.groupby('index')

# `result` is the DataFrame containing the aggregated results.
result = grouped.aggregate(np.median)
print result

Output:

       value
index       
1        3.0
2        5.5
5        8.5

There are ways to create the DataFrame containing the original data directly, so you don't necessarily have to create the numpy array a first.

More information about the groupby operation in Pandas: http://pandas.pydata.org/pandas-docs/dev/groupby.html

104

answered Sep 29 '22 13:09

Warren Weckesser

This is a little bit annoying to do, but at least you can remove that annoying == easily, using sorting (and thats probably your speed killer). Trying more is probably not very useful, though it might be possible if you sort yourself, etc.:

# First sor the whole thing (probably other ways):
sorter = np.argsort(a[:,0]) # sort by class.
a = a[sorter] # sorted version of a

# Now we need to find where there are changes in the class:
w = np.where(a[:-1,0] != a[1:,0])[0] + 1 # Where the class changes.
# for simplicity, append [0] and [len(a)] to have full slices...
w = np.concatenate([0], w, [len(a)])
result = np.zeros(len(w)-1, dtype=a.dtype)
for i in xrange(0, len(w)-1):
    result[0] = np.median(a[w[i]:w[i+1]])

# If the classes are not exactly 1, 2, ..., N we could add class information:
classes = a[w[:-1],0]

If all your classes are the same size, so there are exactly as many 1s, as 2s, etc. There are better ways though.

EDIT: Check Bitwises version for a solution to avoid the last for loop as well (he also hides some of this code into np.unique which you may prefere, since speed should not matter for that anyways).

answered Sep 29 '22 11:09

seberg

Related questions
                            
                                Python NumPy - FFT and Inverse FFT?
                            
                                django: Save image from url inside another model
                            
                                Python twisted reactor undefined variable
                            
                                How can you parse a document stored in the MARC21 format with Python
                            
                                How to escape “\” characters in python
                            
                                Expand alphabetical range to list of characters in Python
                            
                                Removing spaces and empty lines from a file Using Python
                            
                                Should I write dict or {} in Python when constructing a dictionary with string keys?
                            
                                Testing a condition that doesn't change inside a loop
                            
                                How do I write a regex to replace a word but keep its case in Python?
                            
                                How do I create a static framed ASCII interface in Python?
                            
                                Complex sort with multiple parameters?
                            
                                How to convert binary string to ascii string in python? [duplicate]
                            
                                python byRef // copy
                            
                                Python regex for matching two or three white spaces
                            
                                Some confusion regarding imports in Python
                            
                                Is type the super class of all classes in Python?
                            
                                Does, With open() not works with python 2.6
                            
                                How to get all children of queryset in django?
                            
                                What does the underscore represent in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

dealing with arrays: how to avoid a "for" statement

Tags:

python

arrays

for-loop

numpy

andreaconsole

People also ask

2 Answers

Warren Weckesser

seberg

Recent Activity

Donate For Us