Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove many index ranges from Pandas DataFrame

Tags:

python

pandas

Question + MWE

How can I drop/remove multiple ranges of rows from a Pandas DataFrame with a (two level) multi index looking like this:

idx1    idx2  |  value(s)   ...
------------------------------------------
4       0     |  1.123456   ...
        1     |  2.234567   ...
        2     |  0.012345   ...
8       0     | -1.123456   ...
        1     | -0.973915   ...
        2     |  1.285553   ...
        3     | -0.194625   ...
        4     | -0.144112   ...
...     ...   | ...         ...

The ranges to drop/remove are currently residing in a list like this:

ranges = [[(4, 1), (4, 2)],          # range (4,1):(4,2)
          [(8, 0), (8, 3)],          # range (8,0):(8,3)
          [(8, 5), (8, 10)], ...]    # range (8,5):(8,10)

The main problem is, that most methods I found, don't support either multi indexing or slicing or multiple slices/ranges.

What is the best/fastest way to do so.

Current ugly solution

for range in ranges:
    df = df.drop(df.loc[range[0]:range[1]].index)

Is slow and ugly, but is the only workable solution I found, combining multi indexing, slicing and in a way multiple ranges (by looping over the ranges).

Solution comparison

All three proposed solutions work. They all solve the issue by transforming the list of slices into a list of all individual tuples within those slice.

Slices to complete set of tuples

The fastest way to do so is @ALollz solution:

idx = [(x, z) for (x, i), (_, j) in ranges for z in np.arange(i,j+1,1)]

Performance

Regarding the removal of the rows, all solutions work, but there's a big difference in performance (all performance data based on my data set with ~10 Mio. entries)

  1. @ALollz + @Ben. T's combined solution (~19 sec.)

    df.drop(pd.MultiIndex.from_tuples(idx))
    

    or without creating a MultiIndex object

    df.drop(idx)
    
  2. @ALollz first solution (~75 sec.)

    df.loc[list(set(df.index.values) - set(idx))]
    
  3. @user3471881's solution (~95 sec.)

    df.loc[~df.index.isin(ranges)]
    
  4. my ugly solution (~350 sec.)

    see above
    
like image 262
Ichixgo Avatar asked Oct 17 '22 10:10

Ichixgo


2 Answers

You could create a new list of indices and as Ben.T points out, just drop them.

import numpy as np
import pandas as pd

idx = [(x, z) for (x, i), (_, j) in ranges for z in np.arange(i,j+1,1)]
df.drop(pd.MultiIndex.from_tuples(idx))

Output:

           value(s)
idx1 idx2          
4    0            4
8    4           11
like image 168
ALollz Avatar answered Oct 22 '22 11:10

ALollz


The range list you are using forces us to use multiple slices, which could be fine but doesn't seem to be what you want.

If you instead fill your list with all indexes that you want removed (you said in a comment that you could do this):

ranges = [(4, 1), (4, 2), (8, 0), (8, 1), (8, 2), (8, 3) ... ]

You could just access the index of the DataFrame and check if it isin() your list of tuples.

df.index.isin(ranges)

To remove the indexes that are in your list of ranges, add a tilde and then use as mask.

df[~df.index.isin(ranges)]
like image 33
user3471881 Avatar answered Oct 22 '22 09:10

user3471881