Let's assume I have a DataFrame with one single column of data. For example:
np.random.random_integers(0,2,1000)
data = np.cumsum(np.random.random_integers(0,2,1000))
idx = pd.date_range('1-1-2001', freq='D', periods=1000)
df = pd.DataFrame(data, idx)
Instead of working with the complete DataFrame I want to return only those rows which differ from the previous row.
Hence, this
2001-01-20 21
2001-01-21 21
2001-01-22 21
2001-01-23 23
2001-01-24 24
2001-01-25 24
would result in this
2001-01-20 21
2001-01-23 23
2001-01-24 24
Right now I would do it this way
dff = df.diff() # Compute another Series with the differences
dff.ix[0, ] = df.ix[0, ] # Instead of NAN for the row use first row of df
df['diff'] = dff # Add as column in df
df = df[df['diff'] >= 1] # Filter out
df = df.ix[:, 0:-1] # Drop additional column
This seems awfully complicated. I feel like I am missing something. Any ideas how to make it more pythonic and panda-esque?
You could compare the previous and current rows using .shift()
and then index the DataFrame using the corresponding boolean Series:
df.loc[df['a'] != df['a'].shift()]
(I've assumed that your column is called 'a').
.shift()
just moves the values in a column/Series up or down by a specified number of places (the default is 1 down).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With