Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to query MultiIndex index columns values in pandas

Code example:

In [171]: A = np.array([1.1, 1.1, 3.3, 3.3, 5.5, 6.6])  In [172]: B = np.array([111, 222, 222, 333, 333, 777])  In [173]: C = randint(10, 99, 6)  In [174]: df = pd.DataFrame(zip(A, B, C), columns=['A', 'B', 'C'])  In [175]: df.set_index(['A', 'B'], inplace=True)  In [176]: df Out[176]:            C A   B       1.1 111  20     222  31 3.3 222  24     333  65 5.5 333  22 6.6 777  74  

Now, I want to retrieve A values:
Q1: in range [3.3, 6.6] - expected return value: [3.3, 5.5, 6.6] or [3.3, 3.3, 5.5, 6.6] in case last inclusive, and [3.3, 5.5] or [3.3, 3.3, 5.5] if not.
Q2: in range [2.0, 4.0] - expected return value: [3.3] or [3.3, 3.3]

Same for any other MultiIndex dimension, for example B values:
Q3: in range [111, 500] with repetitions, as number of data rows in range - expected return value: [111, 222, 222, 333, 333]

More formal:

Let us assume T is a table with columns A, B and C. The table includes n rows. Table cells are numbers, for example A double, B and C integers. Let's create a DataFrame of table T, let us name it DF. Let's set columns A and B indexes of DF (without duplication, i.e. no separate columns A and B as indexes, and separate as data), i.e. A and B in this case MultiIndex.

Questions:

  1. How to write a query on the index, for example, to query the index A (or B), say in the labels interval [120.0, 540.0]? Labels 120.0 and 540.0 exist. I must clarify that I am interested only in the list of indices as a response to the query!
  2. How to the same, but in case of the labels 120.0 and 540.0 do not exist, but there are labels by value lower than 120, higher than 120 and less than 540, or higher than 540?
  3. In case the answer for Q1 and Q2 was unique index values, now the same, but with repetitions, as number of data rows in index range.

I know the answers to the above questions in the case of columns which are not indexes, but in the indexes case, after a long research in the web and experimentation with the functionality of pandas, I did not succeed. The only method (without additional programming) I see now is to have a duplicate of A and B as data columns in addition to index.

like image 573
Vyacheslav Shkolyar Avatar asked Jul 29 '13 09:07

Vyacheslav Shkolyar


People also ask

How do I get MultiIndex columns in pandas?

pandas MultiIndex to ColumnsUse pandas DataFrame. reset_index() function to convert/transfer MultiIndex (multi-level index) indexes to columns. The default setting for the parameter is drop=False which will keep the index values as columns and set the new index to DataFrame starting from zero.

How do I select a specific index in pandas?

Select Rows & Columns by Name or Index in Pandas DataFrame using [ ], loc & iloc. Indexing in Pandas means selecting rows and columns of data from a Dataframe.


1 Answers

To query the df by the MultiIndex values, for example where (A > 1.7) and (B < 666):

In [536]: result_df = df.loc[(df.index.get_level_values('A') > 1.7) & (df.index.get_level_values('B') < 666)]  In [537]: result_df Out[537]:            C A   B       3.3 222  43     333  59 5.5 333  56 

Hence, to get for example the 'A' index values, if still required:

In [538]: result_df.index.get_level_values('A') Out[538]: Index([3.3, 3.3, 5.5], dtype=object) 

The problem is, that in large data frames the performance of by index selection worse by 10% than the sorted regular rows selection. And in repetitive work, looping, the delay accumulated. See example:

In [558]: df = store.select(STORE_EXTENT_BURSTS_DF_KEY)  In [559]: len(df) Out[559]: 12857  In [560]: df.sort(inplace=True)  In [561]: df_without_index = df.reset_index()  In [562]: %timeit df.loc[(df.index.get_level_values('END_TIME') > 358200) & (df.index.get_level_values('START_TIME') < 361680)] 1000 loops, best of 3: 562 µs per loop  In [563]: %timeit df_without_index[(df_without_index.END_TIME > 358200) & (df_without_index.START_TIME < 361680)] 1000 loops, best of 3: 507 µs per loop 
like image 175
Vyacheslav Shkolyar Avatar answered Oct 02 '22 10:10

Vyacheslav Shkolyar