I am trying to work with a pandas multiindex dataframe that looks like this:
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001065 3001066 G|T
3001110 3001111 G|C
3001131 3001132 G|A
I want to be able to do this:
df.loc[('chr1', slice(3000714, 3001110))]
That fails with the following error:
cannot do slice indexing on with these indexers [1204741] of
df.index.levels[1].dtype
returns dtype('int64')
, so it should work with integer slices right?
Also, any comments on how to do this efficiently would be valuable, as the dataframe has 12 million rows and I need to query it with this kind of slice query ~70 million times.
I think you need add ,:
to the end - it means you need slicing rows, but need all columns:
print (df.loc[('chr1', slice(3000714, 3001110)),:])
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001065 3001066 G|T
3001110 3001111 G|C
Another solution is add axis=0
to loc
:
print (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001065 3001066 G|T
3001110 3001111 G|C
But if need only 3000714
and 3001110
:
print (df.loc[('chr1', [3000714, 3001110]),:])
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001110 3001111 G|C
idx = pd.IndexSlice
print (df.loc[idx['chr1', [3000714, 3001110]],:])
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001110 3001111 G|C
Timings:
In [21]: %timeit (df.loc[('chr1', slice(3000714, 3001110)),:])
1000 loops, best of 3: 757 µs per loop
In [22]: %timeit (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
1000 loops, best of 3: 743 µs per loop
In [23]: %timeit (df.loc[('chr1', [3000714, 3001110]),:])
1000 loops, best of 3: 824 µs per loop
In [24]: %timeit (df.loc[pd.IndexSlice['chr1', [3000714, 3001110]],:])
The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 826 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With