In Pandas, what is a good way to select sets of arbitrary rows in a multiindex? <pre class="prettyprint"><code>df = pd.DataFrame(columns=['A', 'B', 'C']) df['A'] = ['a', 'a', 'b', 'b'] df['B'] = [1,2,3,4] df['C'] = [1,2,3,4] the_indices_we_want = df.ix[[0,3],['A','B']] df = df.set_index(['A', 'B']) #Create a multiindex df.ix[the_indices_we_want] #ValueError: Cannot index with multidimensional key df.ix[[tuple(x) for x in the_indices_we_want.values]] </code></pre> This last line is an answer, but it feels clunky answer; they can't even be lists, they have to be tuples. It also involves generating a new object to do the indexing with. I'm in a situation where I'm trying to do a lookup on a multiindex dataframe, with indices from another dataframe: <pre class="prettyprint"><code>data_we_want = dataframe_with_the_data.ix[dataframe_with_the_indices[['Index1','Index2']]] </code></pre> Right now it looks like I need to write it like this: <pre class="prettyprint"><code>data_we_want = dataframe_with_the_data.ix[[tuple(x) for x in dataframe_with_the_indices[['Index1','Index2']].values]] </code></pre> That is workable, but if there are many rows (i.e. hundreds of millions of desired indices) then generating this list of tuples becomes quite the burden. Any solutions? <hr> Edit: The solution by @joris works, but not if the indices are all numbers. Example where the indices are all integers: <pre class="prettyprint"><code>df = pd.DataFrame(columns=['A', 'B', 'C']) df['A'] = ['a', 'a', 'b', 'b'] df['B'] = [1,2,3,4] df['C'] = [1,2,3,4] the_indices_we_want = df.ix[[0,3],['B','C']] df = df.set_index(['B', 'C']) df.ix[pd.Index(the_indices_we_want)] #ValueError: Cannot index with multidimensional key df.ix[pd.Index(the_indices_we_want.astype('object'))] #Works, though feels clunky. </code></pre>

In newer versions of pandas you can simply use .iloc for row indexing. <pre class="prettyprint"><code>df = pd.DataFrame(columns=['A', 'B', 'C']) df['A'] = ['a', 'a', 'b', 'b'] df['B'] = [1,2,3,4] df['C'] = [1,2,3,4] df.iloc[[0, 3]][['A', 'B']] </code></pre>

How to get away with a multidimensional index in pandas

Tags:

python

pandas

multi-index

In Pandas, what is a good way to select sets of arbitrary rows in a multiindex?

df = pd.DataFrame(columns=['A', 'B', 'C'])
df['A'] = ['a', 'a', 'b', 'b']
df['B'] = [1,2,3,4]
df['C'] = [1,2,3,4]

the_indices_we_want = df.ix[[0,3],['A','B']]
df = df.set_index(['A', 'B']) #Create a multiindex

df.ix[the_indices_we_want] #ValueError: Cannot index with multidimensional key

df.ix[[tuple(x) for x in the_indices_we_want.values]]

This last line is an answer, but it feels clunky answer; they can't even be lists, they have to be tuples. It also involves generating a new object to do the indexing with. I'm in a situation where I'm trying to do a lookup on a multiindex dataframe, with indices from another dataframe:

data_we_want = dataframe_with_the_data.ix[dataframe_with_the_indices[['Index1','Index2']]]

Right now it looks like I need to write it like this:

data_we_want = dataframe_with_the_data.ix[[tuple(x) for x in dataframe_with_the_indices[['Index1','Index2']].values]]

That is workable, but if there are many rows (i.e. hundreds of millions of desired indices) then generating this list of tuples becomes quite the burden. Any solutions?

Edit: The solution by @joris works, but not if the indices are all numbers. Example where the indices are all integers:

df = pd.DataFrame(columns=['A', 'B', 'C'])
df['A'] = ['a', 'a', 'b', 'b']
df['B'] = [1,2,3,4]
df['C'] = [1,2,3,4]

the_indices_we_want = df.ix[[0,3],['B','C']]
df = df.set_index(['B', 'C'])

df.ix[pd.Index(the_indices_we_want)] #ValueError: Cannot index with multidimensional key

df.ix[pd.Index(the_indices_we_want.astype('object'))] #Works, though feels clunky.

892

asked Mar 10 '15 11:03

jeffalstott

2 Answers

You indeed cannot index with a DataFrame directly, but if you convert it to an Index object, it does the correct thing (a row in the DataFrame will be regarded as one multi-index entry):

In [43]: pd.Index(the_indices_we_want)
Out[43]: Index([(u'a', 1), (u'b', 4)], dtype='object')

In [44]: df.ix[pd.Index(the_indices_we_want)]
Out[44]:
     C
A B
a 1  1
b 4  4

In [45]: df.ix[[tuple(x) for x in the_indices_we_want.values]]
Out[45]:
     C
A B
a 1  1
b 4  4

This is a somewhat cleaner. And with some quick tests it seems to be a bit faster (but not much, only 2 times)

answered Nov 15 '22 15:11

joris

In newer versions of pandas you can simply use .iloc for row indexing.

df = pd.DataFrame(columns=['A', 'B', 'C'])
df['A'] = ['a', 'a', 'b', 'b']
df['B'] = [1,2,3,4]
df['C'] = [1,2,3,4]
df.iloc[[0, 3]][['A', 'B']]

answered Nov 15 '22 14:11

bjonen

Related questions
                            
                                Conditionally add items to a list when defining the list?
                            
                                Django REST Framework and python-social-auth for registration/login user
                            
                                pass a json string as an argument to Python script causes quotes problems
                            
                                Python ConnectionRefusedError: [Errno 61] Connection refused
                            
                                How to compute standard error from ODR results?
                            
                                pytest: How to force raising Exceptions during unit-testing?
                            
                                Django Celery task on Heroku causes high memory usage
                            
                                How do I use rasterio/python to mask a raster using a shapefile, to set the raster pixels inside the polygons to zero?
                            
                                Pandas TimeGrouper on multiindex
                            
                                Boto3 InvalidParameterException
                            
                                Redistributing excess values in numpy 2D array
                            
                                Overwrite django choices output in graphene
                            
                                Synchronous sleep into asyncio coroutine
                            
                                iterate over pyspark dataframe columns
                            
                                ImportError: No module named datasets
                            
                                How can I locate something on my screen quickly in Python?
                            
                                Calling Cython functions from Numba jitted code
                            
                                Why do people say "Don't use place()"?
                            
                                Print a postgresql table to standard output in python
                            
                                format value that could be number and/or string in python 3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With