Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering a Pandas DataFrame Without Removing Rows

I'm trying to use where on my Pandas DataFrame in replace all cells that don't meet my criteria with NaN. Howevever, I'd like to do it in such a way that will always preserve the shape of my original DataFrame, and not remove any rows from the resulting DataFrame.

Given the following DataFrame:

      A    B    C    D
1/1   0    1    0    1
1/2   2    1    1    1
1/3   3    0    1    0 
1/4   1    0    1    2
1/5   1    0    1    1
1/6   2    0    2    1
1/7   3    5    2    3

I would like to search the dataframe for all cells that meet a certain criteria, when column D ALSO meets a particular criteria. In this case my criteria is:

Find all cells that are greater than the previous value, when column D is also > 1

I accomplish this by using the following syntax:

matches = df[df > df.shift(1))]
matches = matches[df.D > 1]

I have to split this query into two statements because of the fact that df.D is a Series and does not match the shape of the entire DataFrame. According to this question I asked previously, support for a broadcasting & operator will not be available until 0.14.

The problem I am having is that it seems like after I run the second statement, the shape of the resulting data frame is changed and rows have been removed. The number of columns stays the same. The first statement leaves the original number of rows.

Why would the second statement remove rows while the first does not? How could I achieve the same result, but leaving the full number of rows in tact?

Edit:

The pandas documentation states that in order to guarantee that the shape is preserved, I should use the where method over boolean indexing. However, that does not seem to be allowed to perform my second statement, so:

matches.where(df.D > 1)

Gives me the following error:

ValueError: Array conditional must be same shape as self

like image 692
mclark1129 Avatar asked Oct 22 '13 00:10

mclark1129


People also ask

How do you filter DataFrame based on rows?

You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows. You can also write the above statement with a variable.

Which of the following will filter rows in a Pandas DataFrame?

Often, you want to find instances of a specific value in your DataFrame. You can easily filter rows based on whether they contain a value or not using the . loc indexing method.

How do I filter specific rows from a DataFrame in Python?

To filter rows based on dates, first format the dates in the DataFrame to datetime64 type. Then use the DataFrame. loc[] and DataFrame. query[] function from the Pandas package to specify a filter condition.


2 Answers

This is slightly more intuitive than @DSM answer (but pandas missing this type of auto-broadcasting on boolean ops ATM)

In [58]: df.where((df>df.shift(1)).values & DataFrame(df.D==1).values)
Out[58]: 
      A   B   C   D
1/1 NaN NaN NaN NaN
1/2   2 NaN   1 NaN
1/3 NaN NaN NaN NaN
1/4 NaN NaN NaN NaN
1/5 NaN NaN NaN NaN
1/6   2 NaN   2 NaN
1/7 NaN NaN NaN NaN

see here for the issue to be addressed in 0.14

like image 66
Jeff Avatar answered Nov 10 '22 17:11

Jeff


If I understand what you're after, you can do the broadcasting manually by dropping down to the numpy level:

>>> (df > df.shift(1)).values & (df.D == 1)[:,None]
array([[False, False, False, False],
       [ True, False,  True, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [ True, False,  True, False],
       [False, False, False, False]], dtype=bool)

after which you can use where:

>>> df.where((df > df.shift(1)).values & (df.D == 1)[:,None], np.nan)
      A   B   C   D
1/1 NaN NaN NaN NaN
1/2   2 NaN   1 NaN
1/3 NaN NaN NaN NaN
1/4 NaN NaN NaN NaN
1/5 NaN NaN NaN NaN
1/6   2 NaN   2 NaN
1/7 NaN NaN NaN NaN
like image 44
DSM Avatar answered Nov 10 '22 18:11

DSM