Code example:
In [171]: A = np.array([1.1, 1.1, 3.3, 3.3, 5.5, 6.6])  In [172]: B = np.array([111, 222, 222, 333, 333, 777])  In [173]: C = randint(10, 99, 6)  In [174]: df = pd.DataFrame(zip(A, B, C), columns=['A', 'B', 'C'])  In [175]: df.set_index(['A', 'B'], inplace=True)  In [176]: df Out[176]:            C A   B       1.1 111  20     222  31 3.3 222  24     333  65 5.5 333  22 6.6 777  74    Now, I want to retrieve A values:
Q1: in range [3.3, 6.6] - expected return value: [3.3, 5.5, 6.6] or [3.3, 3.3, 5.5, 6.6] in case last inclusive, and [3.3, 5.5] or [3.3, 3.3, 5.5] if not.
Q2: in range [2.0, 4.0] - expected return value: [3.3] or [3.3, 3.3]  
Same for any other MultiIndex dimension, for example B values:
Q3: in range [111, 500] with repetitions, as number of data rows in range - expected return value: [111, 222, 222, 333, 333]    
More formal:
Let us assume T is a table with columns A, B and C. The table includes n rows. Table cells are numbers, for example A double, B and C integers. Let's create a DataFrame of table T, let us name it DF. Let's set columns A and B indexes of DF (without duplication, i.e. no separate columns A and B as indexes, and separate as data), i.e. A and B in this case MultiIndex.
Questions:
I know the answers to the above questions in the case of columns which are not indexes, but in the indexes case, after a long research in the web and experimentation with the functionality of pandas, I did not succeed. The only method (without additional programming) I see now is to have a duplicate of A and B as data columns in addition to index.
pandas MultiIndex to ColumnsUse pandas DataFrame. reset_index() function to convert/transfer MultiIndex (multi-level index) indexes to columns. The default setting for the parameter is drop=False which will keep the index values as columns and set the new index to DataFrame starting from zero.
Select Rows & Columns by Name or Index in Pandas DataFrame using [ ], loc & iloc. Indexing in Pandas means selecting rows and columns of data from a Dataframe.
To query the df by the MultiIndex values, for example where (A > 1.7) and (B < 666):
In [536]: result_df = df.loc[(df.index.get_level_values('A') > 1.7) & (df.index.get_level_values('B') < 666)]  In [537]: result_df Out[537]:            C A   B       3.3 222  43     333  59 5.5 333  56   Hence, to get for example the 'A' index values, if still required:
In [538]: result_df.index.get_level_values('A') Out[538]: Index([3.3, 3.3, 5.5], dtype=object)   The problem is, that in large data frames the performance of by index selection worse by 10% than the sorted regular rows selection. And in repetitive work, looping, the delay accumulated. See example:
In [558]: df = store.select(STORE_EXTENT_BURSTS_DF_KEY)  In [559]: len(df) Out[559]: 12857  In [560]: df.sort(inplace=True)  In [561]: df_without_index = df.reset_index()  In [562]: %timeit df.loc[(df.index.get_level_values('END_TIME') > 358200) & (df.index.get_level_values('START_TIME') < 361680)] 1000 loops, best of 3: 562 µs per loop  In [563]: %timeit df_without_index[(df_without_index.END_TIME > 358200) & (df_without_index.START_TIME < 361680)] 1000 loops, best of 3: 507 µs per loop 
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With