How to query MultiIndex index columns values in pandas

Tags:

Code example:

In [171]: A = np.array([1.1, 1.1, 3.3, 3.3, 5.5, 6.6])  In [172]: B = np.array([111, 222, 222, 333, 333, 777])  In [173]: C = randint(10, 99, 6)  In [174]: df = pd.DataFrame(zip(A, B, C), columns=['A', 'B', 'C'])  In [175]: df.set_index(['A', 'B'], inplace=True)  In [176]: df Out[176]:            C A   B       1.1 111  20     222  31 3.3 222  24     333  65 5.5 333  22 6.6 777  74

Now, I want to retrieve A values:
Q1: in range [3.3, 6.6] - expected return value: [3.3, 5.5, 6.6] or [3.3, 3.3, 5.5, 6.6] in case last inclusive, and [3.3, 5.5] or [3.3, 3.3, 5.5] if not.
Q2: in range [2.0, 4.0] - expected return value: [3.3] or [3.3, 3.3]

Same for any other MultiIndex dimension, for example B values:
Q3: in range [111, 500] with repetitions, as number of data rows in range - expected return value: [111, 222, 222, 333, 333]

More formal:

Let us assume T is a table with columns A, B and C. The table includes n rows. Table cells are numbers, for example A double, B and C integers. Let's create a DataFrame of table T, let us name it DF. Let's set columns A and B indexes of DF (without duplication, i.e. no separate columns A and B as indexes, and separate as data), i.e. A and B in this case MultiIndex.

Questions:

How to write a query on the index, for example, to query the index A (or B), say in the labels interval [120.0, 540.0]? Labels 120.0 and 540.0 exist. I must clarify that I am interested only in the list of indices as a response to the query!
How to the same, but in case of the labels 120.0 and 540.0 do not exist, but there are labels by value lower than 120, higher than 120 and less than 540, or higher than 540?
In case the answer for Q1 and Q2 was unique index values, now the same, but with repetitions, as number of data rows in index range.

I know the answers to the above questions in the case of columns which are not indexes, but in the indexes case, after a long research in the web and experimentation with the functionality of pandas, I did not succeed. The only method (without additional programming) I see now is to have a duplicate of A and B as data columns in addition to index.

573

asked Jul 29 '13 09:07

Vyacheslav Shkolyar

1 Answers

To query the df by the MultiIndex values, for example where (A > 1.7) and (B < 666):

In [536]: result_df = df.loc[(df.index.get_level_values('A') > 1.7) & (df.index.get_level_values('B') < 666)]  In [537]: result_df Out[537]:            C A   B       3.3 222  43     333  59 5.5 333  56

Hence, to get for example the 'A' index values, if still required:

In [538]: result_df.index.get_level_values('A') Out[538]: Index([3.3, 3.3, 5.5], dtype=object)

The problem is, that in large data frames the performance of by index selection worse by 10% than the sorted regular rows selection. And in repetitive work, looping, the delay accumulated. See example:

In [558]: df = store.select(STORE_EXTENT_BURSTS_DF_KEY)  In [559]: len(df) Out[559]: 12857  In [560]: df.sort(inplace=True)  In [561]: df_without_index = df.reset_index()  In [562]: %timeit df.loc[(df.index.get_level_values('END_TIME') > 358200) & (df.index.get_level_values('START_TIME') < 361680)] 1000 loops, best of 3: 562 µs per loop  In [563]: %timeit df_without_index[(df_without_index.END_TIME > 358200) & (df_without_index.START_TIME < 361680)] 1000 loops, best of 3: 507 µs per loop

175

answered Oct 02 '22 10:10

Vyacheslav Shkolyar

Related questions
                            
                                What does Python's socket.recv() return for non-blocking sockets if no data is received until a timeout occurs?
                            
                                Why doesn't Pylint like built-in functions?
                            
                                Generating movie from python without saving individual frames to files
                            
                                How to print all variables values when debugging Python with pdb, without specifying each variable?
                            
                                Difference between Class and Instance methods
                            
                                Does virtualenv serve a purpose (in production) when using docker?
                            
                                What is the Python egg cache (PYTHON_EGG_CACHE)?
                            
                                importing izip from itertools module gives NameError in Python 3.x
                            
                                Which maximum does Python pick in the case of a tie?
                            
                                Is everything greater than None?
                            
                                Find length of longest string in Pandas dataframe column
                            
                                Why is Python list slower when sorted?
                            
                                access ElementTree node parent node
                            
                                Docker process killed with cryptic `Killed` message
                            
                                Matplotlib/Pandas error using histogram
                            
                                How to group a Series by values in pandas?
                            
                                A safe max() function for empty lists
                            
                                Can't use unichr in Python 3.1
                            
                                Is there a memory efficient and fast way to load big JSON files?
                            
                                Text box with line wrapping in matplotlib?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to query MultiIndex index columns values in pandas

Tags:

python

slice

indexing

pandas

multi-index

Vyacheslav Shkolyar

People also ask

1 Answers

Vyacheslav Shkolyar

Recent Activity

Donate For Us