Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter numpy ndarray (matrix) according to column values

This question is about filtering a NumPy ndarray according to some column values.

I have a fairly large NumPy ndarray (300000, 50) and I am filtering it according to values in some specific columns. I have ndtypes so I can access each column by name.

The first column is named category_code and I need to filter the matrix to return only rows where category_code is in ("A", "B", "C").

The result would need to be another NumPy ndarray whose columns are still accessible by the dtype names.

Here is what I do now:

index = numpy.asarray([row['category_code'] in ('A', 'B', 'C') for row in data])
filtered_data = data[index]

List comprehension like:

list = [row for row in data if row['category_code'] in ('A', 'B', 'C')]
filtered_data = numpy.asarray(list)

wouldn't work because the dtypes I originally had are no longer accessible.

Are there any better / more Pythonic way of achieving the same result?

Something that could look like:

filtered_data = data.where({'category_code': ('A', 'B','C'})

Thanks!

like image 547
Nicolas M. Avatar asked Aug 23 '12 20:08

Nicolas M.


People also ask

How we can filter values evaluate values in NumPy arrays?

In NumPy, you filter an array using a boolean index list. A boolean index list is a list of booleans corresponding to indexes in the array. If the value at an index is True that element is contained in the filtered array, if the value at that index is False that element is excluded from the filtered array.

How do you filter Ndarray?

You can filter a numpy array by creating a list or an array of boolean values indicative of whether or not to keep the element in the corresponding array. This method is called boolean mask slicing.


2 Answers

You can use the NumPy-based library, Pandas, which has a more generally useful implementation of ndarrays:

>>> # import the library
>>> import pandas as PD

Create some sample data as python dictionary, whose keys are the column names and whose values are the column values as a python list; one key/value pair per column

>>> data = {'category_code': ['D', 'A', 'B', 'C', 'D', 'A', 'C', 'A'], 
            'value':[4, 2, 6, 3, 8, 4, 3, 9]}

>>> # convert to a Pandas 'DataFrame'
>>> D = PD.DataFrame(data)

To return just the rows in which the category_code is either B or C, two steps conceptually, but can easily be done in a single line:

>>> # step 1: create the index 
>>> idx = (D.category_code== 'B') | (D.category_code == 'C')

>>> # then filter the data against that index:
>>> D.ix[idx]

        category_code  value
   2             B      6
   3             C      3
   6             C      3

Note the difference between indexing in Pandas versus NumPy, the library upon which Pandas is built. In NumPy, you would just place the index inside the brackets, indicating which dimension you are indexing with a ",", and using ":" to indicate that you want all of the values (columns) in the other dimension:

>>>  D[idx,:]

In Pandas, you call the the data frame's ix method, and place only the index inside the brackets:

>>> D.loc[idx]
like image 76
doug Avatar answered Sep 24 '22 02:09

doug


If you can choose, I strongly recommend pandas: it has "column indexing" built-in plus a lot of other features. It is built on numpy.

like image 33
lbolla Avatar answered Sep 25 '22 02:09

lbolla