This question is about filtering a NumPy
ndarray
according to some column values.
I have a fairly large NumPy
ndarray
(300000, 50) and I am filtering it according to values in some specific columns. I have ndtypes
so I can access each column by name.
The first column is named category_code
and I need to filter the matrix to return only rows where category_code
is in ("A", "B", "C")
.
The result would need to be another NumPy
ndarray
whose columns are still accessible by the dtype
names.
Here is what I do now:
index = numpy.asarray([row['category_code'] in ('A', 'B', 'C') for row in data])
filtered_data = data[index]
List comprehension like:
list = [row for row in data if row['category_code'] in ('A', 'B', 'C')]
filtered_data = numpy.asarray(list)
wouldn't work because the dtypes
I originally had are no longer accessible.
Are there any better / more Pythonic way of achieving the same result?
Something that could look like:
filtered_data = data.where({'category_code': ('A', 'B','C'})
Thanks!
In NumPy, you filter an array using a boolean index list. A boolean index list is a list of booleans corresponding to indexes in the array. If the value at an index is True that element is contained in the filtered array, if the value at that index is False that element is excluded from the filtered array.
You can filter a numpy array by creating a list or an array of boolean values indicative of whether or not to keep the element in the corresponding array. This method is called boolean mask slicing.
You can use the NumPy-based library, Pandas, which has a more generally useful implementation of ndarrays:
>>> # import the library
>>> import pandas as PD
Create some sample data as python dictionary, whose keys are the column names and whose values are the column values as a python list; one key/value pair per column
>>> data = {'category_code': ['D', 'A', 'B', 'C', 'D', 'A', 'C', 'A'],
'value':[4, 2, 6, 3, 8, 4, 3, 9]}
>>> # convert to a Pandas 'DataFrame'
>>> D = PD.DataFrame(data)
To return just the rows in which the category_code is either B or C, two steps conceptually, but can easily be done in a single line:
>>> # step 1: create the index
>>> idx = (D.category_code== 'B') | (D.category_code == 'C')
>>> # then filter the data against that index:
>>> D.ix[idx]
category_code value
2 B 6
3 C 3
6 C 3
Note the difference between indexing in Pandas versus NumPy, the library upon which Pandas is built. In NumPy, you would just place the index inside the brackets, indicating which dimension you are indexing with a ",", and using ":" to indicate that you want all of the values (columns) in the other dimension:
>>> D[idx,:]
In Pandas, you call the the data frame's ix method, and place only the index inside the brackets:
>>> D.loc[idx]
If you can choose, I strongly recommend pandas: it has "column indexing" built-in plus a lot of other features. It is built on numpy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With