This question is about filtering a <code>NumPy</code> <code>ndarray</code> according to some column values. I have a fairly large <code>NumPy</code> <code>ndarray</code> (300000, 50) and I am filtering it according to values in some specific columns. I have <code>ndtypes</code> so I can access each column by name. The first column is named <code>category_code</code> and I need to filter the matrix to return only rows where <code>category_code</code> is in <code>("A", "B", "C")</code>. The result would need to be another <code>NumPy</code> <code>ndarray</code> whose columns are still accessible by the <code>dtype</code> names. Here is what I do now: <pre class="prettyprint"><code>index = numpy.asarray([row['category_code'] in ('A', 'B', 'C') for row in data]) filtered_data = data[index] </code></pre> List comprehension like: <pre class="prettyprint"><code>list = [row for row in data if row['category_code'] in ('A', 'B', 'C')] filtered_data = numpy.asarray(list) </code></pre> wouldn't work because the <code>dtypes</code> I originally had are no longer accessible. Are there any better / more Pythonic way of achieving the same result? Something that could look like: <pre class="prettyprint"><code>filtered_data = data.where({'category_code': ('A', 'B','C'}) </code></pre> Thanks!

You can use the NumPy-based library, Pandas, which has a more generally useful implementation of ndarrays: <pre class="prettyprint"><code>>>> # import the library >>> import pandas as PD </code></pre> Create some sample data as python dictionary, whose keys are the column names and whose values are the column values as a python list; one key/value pair per column <pre class="prettyprint"><code>>>> data = {'category_code': ['D', 'A', 'B', 'C', 'D', 'A', 'C', 'A'], 'value':[4, 2, 6, 3, 8, 4, 3, 9]} >>> # convert to a Pandas 'DataFrame' >>> D = PD.DataFrame(data) </code></pre> To return just the rows in which the category_code is either B or C, two steps conceptually, but can easily be done in a single line: <pre class="prettyprint"><code>>>> # step 1: create the index >>> idx = (D.category_code== 'B') | (D.category_code == 'C') >>> # then filter the data against that index: >>> D.ix[idx] category_code value 2 B 6 3 C 3 6 C 3 </code></pre> Note the difference between indexing in Pandas versus NumPy, the library upon which Pandas is built. In NumPy, you would just place the index inside the brackets, indicating which dimension you are indexing with a ",", and using ":" to indicate that you want all of the values (columns) in the other dimension: <pre class="prettyprint"><code>>>> D[idx,:] </code></pre> In Pandas, you call the the data frame's ix method, and place only the index inside the brackets: <pre class="prettyprint"><code>>>> D.loc[idx] </code></pre>

Filter numpy ndarray (matrix) according to column values

Tags:

python

matrix

numpy

This question is about filtering a NumPy ndarray according to some column values.

I have a fairly large NumPy ndarray (300000, 50) and I am filtering it according to values in some specific columns. I have ndtypes so I can access each column by name.

The first column is named category_code and I need to filter the matrix to return only rows where category_code is in ("A", "B", "C").

The result would need to be another NumPy ndarray whose columns are still accessible by the dtype names.

Here is what I do now:

index = numpy.asarray([row['category_code'] in ('A', 'B', 'C') for row in data])
filtered_data = data[index]

List comprehension like:

list = [row for row in data if row['category_code'] in ('A', 'B', 'C')]
filtered_data = numpy.asarray(list)

wouldn't work because the dtypes I originally had are no longer accessible.

Are there any better / more Pythonic way of achieving the same result?

Something that could look like:

filtered_data = data.where({'category_code': ('A', 'B','C'})

Thanks!

547

asked Aug 23 '12 20:08

Nicolas M.

2 Answers

You can use the NumPy-based library, Pandas, which has a more generally useful implementation of ndarrays:

>>> # import the library
>>> import pandas as PD

Create some sample data as python dictionary, whose keys are the column names and whose values are the column values as a python list; one key/value pair per column

>>> data = {'category_code': ['D', 'A', 'B', 'C', 'D', 'A', 'C', 'A'], 
            'value':[4, 2, 6, 3, 8, 4, 3, 9]}

>>> # convert to a Pandas 'DataFrame'
>>> D = PD.DataFrame(data)

To return just the rows in which the category_code is either B or C, two steps conceptually, but can easily be done in a single line:

>>> # step 1: create the index 
>>> idx = (D.category_code== 'B') | (D.category_code == 'C')

>>> # then filter the data against that index:
>>> D.ix[idx]

        category_code  value
   2             B      6
   3             C      3
   6             C      3

Note the difference between indexing in Pandas versus NumPy, the library upon which Pandas is built. In NumPy, you would just place the index inside the brackets, indicating which dimension you are indexing with a ",", and using ":" to indicate that you want all of the values (columns) in the other dimension:

>>>  D[idx,:]

In Pandas, you call the the data frame's ix method, and place only the index inside the brackets:

>>> D.loc[idx]

answered Sep 24 '22 02:09

doug

If you can choose, I strongly recommend pandas: it has "column indexing" built-in plus a lot of other features. It is built on numpy.

answered Sep 25 '22 02:09

lbolla

Related questions
                            
                                How do I know which python script is running in taskmgr?
                            
                                How can I configure Pyramid's JSON encoding?
                            
                                Creating a Table with rows of different heights in reportlab
                            
                                Report Lab can't handle hebrew (unicode)
                            
                                python re.sub with a list of words to find
                            
                                SQLAlchemy declarative extension vs. elixir
                            
                                Flask: Using multiple packages in one app
                            
                                How to recognize histograms with a specific shape in opencv / python
                            
                                python pexpect sendcontrol key characters
                            
                                Fuzzy Group By, Grouping Similar Words
                            
                                Tkinter Resize text to contents
                            
                                Is there a better way to broadcast arrays?
                            
                                How to install scikit-learn on heroku cedar?
                            
                                Django: empty form errors
                            
                                Python 3.2 installation on Ubuntu 12.04
                            
                                python encoding error only when called as external process
                            
                                Unpacking a struct ending with an ASCIIZ string
                            
                                Python - list of function/argument tuples
                            
                                MySQL and lock a table, read, and then truncate
                            
                                logging module for python reports incorrect timezone under cygwin

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With