I'm trying to use where
on my Pandas DataFrame in replace all cells that don't meet my criteria with NaN
. Howevever, I'd like to do it in such a way that will always preserve the shape of my original DataFrame, and not remove any rows from the resulting DataFrame.
Given the following DataFrame:
A B C D
1/1 0 1 0 1
1/2 2 1 1 1
1/3 3 0 1 0
1/4 1 0 1 2
1/5 1 0 1 1
1/6 2 0 2 1
1/7 3 5 2 3
I would like to search the dataframe for all cells that meet a certain criteria, when column D
ALSO meets a particular criteria. In this case my criteria is:
Find all cells that are greater than the previous value, when column D is also > 1
I accomplish this by using the following syntax:
matches = df[df > df.shift(1))]
matches = matches[df.D > 1]
I have to split this query into two statements because of the fact that df.D
is a Series and does not match the shape of the entire DataFrame. According to this question I asked previously, support for a broadcasting &
operator will not be available until 0.14.
The problem I am having is that it seems like after I run the second statement, the shape of the resulting data frame is changed and rows have been removed. The number of columns stays the same. The first statement leaves the original number of rows.
Why would the second statement remove rows while the first does not? How could I achieve the same result, but leaving the full number of rows in tact?
Edit:
The pandas documentation states that in order to guarantee that the shape is preserved, I should use the where
method over boolean indexing. However, that does not seem to be allowed to perform my second statement, so:
matches.where(df.D > 1)
Gives me the following error:
ValueError: Array conditional must be same shape as self
You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows. You can also write the above statement with a variable.
Often, you want to find instances of a specific value in your DataFrame. You can easily filter rows based on whether they contain a value or not using the . loc indexing method.
To filter rows based on dates, first format the dates in the DataFrame to datetime64 type. Then use the DataFrame. loc[] and DataFrame. query[] function from the Pandas package to specify a filter condition.
This is slightly more intuitive than @DSM answer (but pandas missing this type of auto-broadcasting on boolean ops ATM)
In [58]: df.where((df>df.shift(1)).values & DataFrame(df.D==1).values)
Out[58]:
A B C D
1/1 NaN NaN NaN NaN
1/2 2 NaN 1 NaN
1/3 NaN NaN NaN NaN
1/4 NaN NaN NaN NaN
1/5 NaN NaN NaN NaN
1/6 2 NaN 2 NaN
1/7 NaN NaN NaN NaN
see here for the issue to be addressed in 0.14
If I understand what you're after, you can do the broadcasting manually by dropping down to the numpy
level:
>>> (df > df.shift(1)).values & (df.D == 1)[:,None]
array([[False, False, False, False],
[ True, False, True, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[ True, False, True, False],
[False, False, False, False]], dtype=bool)
after which you can use where
:
>>> df.where((df > df.shift(1)).values & (df.D == 1)[:,None], np.nan)
A B C D
1/1 NaN NaN NaN NaN
1/2 2 NaN 1 NaN
1/3 NaN NaN NaN NaN
1/4 NaN NaN NaN NaN
1/5 NaN NaN NaN NaN
1/6 2 NaN 2 NaN
1/7 NaN NaN NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With