Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering rows from dataframe based on the values of the previous rows

I have a dataframe like the following:

    A
1   1000
2   1000
3   1001
4   1001
5   10
6   1000
7   1010
8   9
9   10
10  6
11  999
12  10110
13  10111
14  1000 

I am trying to clean my dataframe in the following way: For every row having more value than 1.5 times the previous row value or less than 0.5 times the previous row value, drop it. But If the previous row is a to-drop row, comparison must be made with the immediate previous NON-to-drop row. (For example Index 9, 10 or 13 in my dataframe) So the final dataframe should be like:

    A
1   1000
2   1000
3   1001
4   1001
6   1000
7   1010
11  999
14  1000

My dataframe is really huge so performance is appreciated.

like image 865
Alfonso_MA Avatar asked Nov 20 '19 16:11

Alfonso_MA


Video Answer


2 Answers

You can't get away from looping through each row

Tips
  • Avoid creating new (expensive to create) objects for each row
  • Use a memory efficient iteration

I'd use a generator

I'll pass a series to a function and yield the index values for which rows satisfy the conditions.

def f(s):
    it = s.iteritems()
    i, v = next(it)
    yield i                          # Yield the first one
    for j, x in it:
        if .5 * v <= x <= 1.5 * v:
            yield j                  # Yield the ones that satisfy
            v = x                    # Update the comparative value

df.loc[list(f(df.A))]                # Use `loc` with index values
                                     # yielded by my generator

       A
1   1000
2   1000
3   1001
4   1001
6   1000
7   1010
11   999
14  1000
like image 160
piRSquared Avatar answered Oct 02 '22 23:10

piRSquared


One alternative could be to use itertools.accumulate to push forward the last valid value and then filter out the values that are different from the original, e.g:

from itertools import accumulate


def change(x, y, pct=0.5):
    if pct * x <= y <= (1 + pct) * x:
        return y
    return x

# create a mask filtering out the values that are different from the original A
mask = (df.A == list(accumulate(df.A, change)))

print(df[mask])

Output

       A
1   1000
2   1000
3   1001
4   1001
6   1000
7   1010
11   999
14  1000

Just to get an idea, see how the accumulated column (change) compares to the original side-by-side:

        A  change
1    1000    1000
2    1000    1000
3    1001    1001
4    1001    1001
5      10    1001
6    1000    1000
7    1010    1010
8       9    1010
9      10    1010
10      6    1010
11    999     999
12  10110     999
13  10111     999
14   1000    1000

Update

To make it in the function call do:

mask = (df.A == list(accumulate(df.A, lambda x, y : change(x, y, pct=0.5))))
like image 24
Dani Mesejo Avatar answered Oct 02 '22 23:10

Dani Mesejo