How can I drop/remove multiple ranges of rows from a Pandas DataFrame with a (two level) multi index looking like this:
idx1 idx2 | value(s) ...
------------------------------------------
4 0 | 1.123456 ...
1 | 2.234567 ...
2 | 0.012345 ...
8 0 | -1.123456 ...
1 | -0.973915 ...
2 | 1.285553 ...
3 | -0.194625 ...
4 | -0.144112 ...
... ... | ... ...
The ranges to drop/remove are currently residing in a list like this:
ranges = [[(4, 1), (4, 2)], # range (4,1):(4,2)
[(8, 0), (8, 3)], # range (8,0):(8,3)
[(8, 5), (8, 10)], ...] # range (8,5):(8,10)
The main problem is, that most methods I found, don't support either multi indexing or slicing or multiple slices/ranges.
What is the best/fastest way to do so.
for range in ranges:
df = df.drop(df.loc[range[0]:range[1]].index)
Is slow and ugly, but is the only workable solution I found, combining multi indexing, slicing and in a way multiple ranges (by looping over the ranges).
All three proposed solutions work. They all solve the issue by transforming the list of slices into a list of all individual tuples within those slice.
The fastest way to do so is @ALollz solution:
idx = [(x, z) for (x, i), (_, j) in ranges for z in np.arange(i,j+1,1)]
Regarding the removal of the rows, all solutions work, but there's a big difference in performance (all performance data based on my data set with ~10 Mio. entries)
@ALollz + @Ben. T's combined solution (~19 sec.)
df.drop(pd.MultiIndex.from_tuples(idx))
or without creating a MultiIndex
object
df.drop(idx)
@ALollz first solution (~75 sec.)
df.loc[list(set(df.index.values) - set(idx))]
@user3471881's solution (~95 sec.)
df.loc[~df.index.isin(ranges)]
my ugly solution (~350 sec.)
see above
You could create a new list of indices and as Ben.T points out, just drop them.
import numpy as np
import pandas as pd
idx = [(x, z) for (x, i), (_, j) in ranges for z in np.arange(i,j+1,1)]
df.drop(pd.MultiIndex.from_tuples(idx))
value(s)
idx1 idx2
4 0 4
8 4 11
The range list you are using forces us to use multiple slices, which could be fine but doesn't seem to be what you want.
If you instead fill your list with all indexes that you want removed (you said in a comment that you could do this):
ranges = [(4, 1), (4, 2), (8, 0), (8, 1), (8, 2), (8, 3) ... ]
You could just access the index
of the DataFrame
and check if it isin()
your list of tuples.
df.index.isin(ranges)
To remove the indexes that are in your list of ranges, add a tilde and then use as mask.
df[~df.index.isin(ranges)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With