Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas: cannot do slice indexing

I am trying to work with a pandas multiindex dataframe that looks like this:

                   end ref|alt
chrom start
chr1  3000714  3000715     T|G
      3001065  3001066     G|T
      3001110  3001111     G|C
      3001131  3001132     G|A

I want to be able to do this:

df.loc[('chr1', slice(3000714, 3001110))]

That fails with the following error:

cannot do slice indexing on with these indexers [1204741] of

df.index.levels[1].dtype returns dtype('int64'), so it should work with integer slices right?

Also, any comments on how to do this efficiently would be valuable, as the dataframe has 12 million rows and I need to query it with this kind of slice query ~70 million times.

like image 970
Mike Dacre Avatar asked Mar 12 '23 01:03

Mike Dacre


1 Answers

I think you need add ,: to the end - it means you need slicing rows, but need all columns:

print (df.loc[('chr1', slice(3000714, 3001110)),:])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001065  3001066     G|T
      3001110  3001111     G|C

Another solution is add axis=0 to loc:

print (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001065  3001066     G|T
      3001110  3001111     G|C

But if need only 3000714 and 3001110:

print (df.loc[('chr1', [3000714, 3001110]),:])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001110  3001111     G|C

idx = pd.IndexSlice
print (df.loc[idx['chr1', [3000714, 3001110]],:])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001110  3001111     G|C

Timings:

In [21]: %timeit (df.loc[('chr1', slice(3000714, 3001110)),:])
1000 loops, best of 3: 757 µs per loop

In [22]: %timeit (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
1000 loops, best of 3: 743 µs per loop

In [23]: %timeit (df.loc[('chr1', [3000714, 3001110]),:])
1000 loops, best of 3: 824 µs per loop

In [24]: %timeit (df.loc[pd.IndexSlice['chr1', [3000714, 3001110]],:])
The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 826 µs per loop
like image 121
jezrael Avatar answered Mar 27 '23 08:03

jezrael