I'm trying to slice into a DataFrame that has a MultiIndex composed of an IntervalIndex and a regular Index. Example code: <pre class="prettyprint"><code>from pandas import Interval as ntv df = pd.DataFrame.from_records([ {'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1}, {'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0} ], index=('ntv', 'id')) </code></pre> Looks like this: <pre class="prettyprint"><code> E var1 ntv id (0, 10] 1 1 0.1 (0, 12] 2 0 0.5 </code></pre> What I would like to do is to slice into the DataFrame at a specific value and return all rows that has an interval that contains the value. Ex: <pre class="prettyprint"><code>df.loc[4] </code></pre> should return (trivially) <pre class="prettyprint"><code> E var1 id 1 1 0.1 2 0 0.5 </code></pre> The problem is I keep getting a <code>TypeError</code> about the index, and the docs show a similar operation (but on a single-level index) that does produce what I'm looking for. <pre class="prettyprint"><code>TypeError: only integer scalar arrays can be converted to a scalar index </code></pre> I've tried many things, nothing seems to work normally. I could include the <code>id</code> column inside the dataframe, but I'd rather keep my index unique, and I would constantly be calling <code>set_index('id')</code>. I feel like either a) I'm missing something about MultiIndexes or b) there is a bug / ambiguity with using an IntervalIndex in a MultiIndex.

Since we are speaking intervals there is a method called <code>get_loc</code> to find the rows that has the value in between the interval. To say what I mean : <pre class="prettyprint"><code>from pandas import Interval as ntv df = pd.DataFrame.from_records([ {'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1}, {'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0} ], index=('ntv', 'id')) df.iloc[(df.index.get_level_values(0).get_loc(4))] E var1 ntv id (0, 10] 1 1 0.1 (0, 12] 2 0 0.5 df.iloc[(df.index.get_level_values(0).get_loc(11))] E var1 ntv id (0, 12] 2 0 0.5 </code></pre> This also works if you have multiple rows of data for one inteval i.e <pre class="prettyprint"><code>df = pd.DataFrame.from_records([ {'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1}, {'id': 3, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1}, {'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0} ], index=('ntv', 'id')) df.iloc[(df.index.get_level_values(0).get_loc(4))] E var1 ntv id (0, 10] 1 1 0.1 3 1 0.1 (0, 12] 2 0 0.5 </code></pre> If you time this up with a list comprehension, this approach is way faster for large dataframes i.e <pre class="prettyprint"><code>ndf = pd.concat([df]*10000) %%timeit ndf.iloc[ndf.index.get_level_values(0).get_loc(4)] 10 loops, best of 3: 32.8 ms per loop %%timeit intervals = ndf.index.get_level_values(0) mask = [4 in i for i in intervals] ndf.loc[mask] 1 loop, best of 3: 193 ms per loop </code></pre>

Piggybacking off of @Dark's solution, <code>Index.get_loc</code> just calls <code>Index.get_indexer</code> under the hood, so it might be more efficient to call the underlying method when you don't have additional parameters and red tape. <pre class="prettyprint"><code>idx = df.index.get_level_values(0) df.iloc[idx.get_indexer([4])] </code></pre> My originally proposed solution: <pre class="prettyprint"><code>intervals = df.index.get_level_values(0) mask = [4 in i for i in intervals] df.loc[mask] </code></pre> Regardless, it's certainly strange though that these return two different results, but does look like it has to do with the index being unique/monotonic/neither of the two: <pre class="prettyprint"><code>df.reset_index(level=1, drop=True).loc[4] # good df.loc[4] # TypeError </code></pre>

How can I properly use a Pandas Dataframe with a multiindex that includes Intervals?

Tags:

python

pandas

dataframe

I'm trying to slice into a DataFrame that has a MultiIndex composed of an IntervalIndex and a regular Index. Example code:

from pandas import Interval as ntv

df = pd.DataFrame.from_records([
   {'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1}, 
   {'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))

Looks like this:

            E  var1
ntv     id
(0, 10] 1   1   0.1
(0, 12] 2   0   0.5

What I would like to do is to slice into the DataFrame at a specific value and return all rows that has an interval that contains the value. Ex:

df.loc[4]

should return (trivially)

    E  var1
id
1   1   0.1
2   0   0.5

The problem is I keep getting a TypeError about the index, and the docs show a similar operation (but on a single-level index) that does produce what I'm looking for.

TypeError: only integer scalar arrays can be converted to a scalar index

I've tried many things, nothing seems to work normally. I could include the id column inside the dataframe, but I'd rather keep my index unique, and I would constantly be calling set_index('id').

I feel like either a) I'm missing something about MultiIndexes or b) there is a bug / ambiguity with using an IntervalIndex in a MultiIndex.

748

asked Dec 03 '17 18:12

Cam.Davidson.Pilon

3 Answers

Since we are speaking intervals there is a method called get_loc to find the rows that has the value in between the interval. To say what I mean :

from pandas import Interval as ntv

df = pd.DataFrame.from_records([
   {'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1}, 
   {'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))

df.iloc[(df.index.get_level_values(0).get_loc(4))]
            E  var1
ntv     id         
(0, 10] 1   1   0.1
(0, 12] 2   0   0.5

df.iloc[(df.index.get_level_values(0).get_loc(11))]
             E  var1
ntv     id         
(0, 12] 2   0   0.5

This also works if you have multiple rows of data for one inteval i.e

df = pd.DataFrame.from_records([
   {'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1}, 
   {'id': 3, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
   {'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))

df.iloc[(df.index.get_level_values(0).get_loc(4))]

            E  var1
ntv     id         
(0, 10] 1   1   0.1
        3   1   0.1
(0, 12] 2   0   0.5

If you time this up with a list comprehension, this approach is way faster for large dataframes i.e

ndf = pd.concat([df]*10000)

%%timeit
ndf.iloc[ndf.index.get_level_values(0).get_loc(4)]
10 loops, best of 3: 32.8 ms per loop

%%timeit
intervals = ndf.index.get_level_values(0)
mask = [4 in i for i in intervals]
ndf.loc[mask]
1 loop, best of 3: 193 ms per loop

answered Oct 21 '22 10:10

Bharath

So I did a bit of digging to try and understand the problem. If I try to run your code the following happens. You try to index into the index label with "slice(array([0, 1], dtype=int64), array([1, 2], dtype=int64), None)"

(when I say index_type I mean the Pandas datatype)

An index_type's label is a list of indices that map to the index_type's levels array. Here is an example from the documentation.

   >>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
    >>> pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
    MultiIndex(levels=[[1, 2], ['blue', 'red']],
           labels=[[0, 0, 1, 1], [1, 0, 1, 0]],
           names=['number', 'color'])

Notice how the second list in labels connects to the order of levels. level[1][1] is equal to red, and level[1][0] is equal to blue.

Anyhow, this is all to say that I don't believe intervalindex is meant to be used in an overlapping fashion. If you look at the orginal proposal for it https://github.com/pandas-dev/pandas/issues/7640

"A IntervalIndex would be a monotonic and non-overlapping one-dimensional array of intervals."

My suggestion is to move the interval into a column. You could probably write up a simple function with numba to test if a number is in each interval. Do you mind explaining the way you're benefiting from the interval?

answered Oct 21 '22 08:10

Gabriel A

Piggybacking off of @Dark's solution, Index.get_loc just calls Index.get_indexer under the hood, so it might be more efficient to call the underlying method when you don't have additional parameters and red tape.

idx = df.index.get_level_values(0)
df.iloc[idx.get_indexer([4])]

My originally proposed solution:

intervals = df.index.get_level_values(0)
mask = [4 in i for i in intervals]
df.loc[mask]

Regardless, it's certainly strange though that these return two different results, but does look like it has to do with the index being unique/monotonic/neither of the two:

df.reset_index(level=1, drop=True).loc[4] # good
df.loc[4]  # TypeError

answered Oct 21 '22 09:10

Brad Solomon

Related questions
                            
                                Insert a column to a pandas dataframe
                            
                                ggplot in python: plot size and color
                            
                                Pandas: sum values from column to unique values
                            
                                Flask-Login documentation: LoginForm()
                            
                                Determining Bit-Depth of a wav file
                            
                                Python zip multiple directories into one zip file
                            
                                numpy.ndarray' object is not callable - Using Pandas
                            
                                Replacing newlines with spaces for str columns through pandas dataframe
                            
                                Tkinter - window focus loss event
                            
                                Django class view: __init__
                            
                                Getting a OSError when trying to LIST ftp directories in Python
                            
                                In python, when to use a square or round brackets? [duplicate]
                            
                                Have pyodbc return a simple (scalar) value for a query that only returns one item
                            
                                Append Multiple Excel Files(xlsx) together in python
                            
                                AttributeError: module 'django.contrib.auth.views' has no attribute 'login'
                            
                                Pandas: categorize column values by range
                            
                                Train test split without using scikit learn
                            
                                AttributeError: 'numpy.float64' object has no attribute 'log10'
                            
                                How to save changes in read-only Jupyter Notebook
                            
                                Changing the colour of text automatically inserted into tkinter widget

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With