Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas "diff()" with string

Tags:

python

pandas

How can I flag a row in a dataframe every time a column change its string value?

Ex:

Input

ColumnA   ColumnB 1            Blue 2            Blue 3            Red 4            Red 5            Yellow   #  diff won't work here with strings....  only works in numerical values dataframe['changed'] = dataframe['ColumnB'].diff()           ColumnA   ColumnB      changed 1            Blue         0 2            Blue         0 3            Red          1 4            Red          0 5            Yellow       1 
like image 1000
guilhermecgs Avatar asked Oct 31 '16 18:10

guilhermecgs


People also ask

What does diff do in pandas?

The diff() method returns a DataFrame with the difference between the values for each row and, by default, the previous row. Which row to compare with can be specified with the periods parameter.

How do you subtract two consecutive rows in pandas?

Pandas offers a number of functions related to adjusting rows and enabling you to calculate the difference between them. For example, the Pandas shift method allows us to shift a dataframe in different directions, for example up and down. Because of this, we can easily use the shift method to subtract between rows.

How do you get the difference in rows in pandas?

You can use the DataFrame. diff() function to find the difference between two rows in a pandas DataFrame. where: periods: The number of previous rows for calculating the difference.

How do I compare Series values in pandas?

Algorithm. Step 1: Define two Pandas series, s1 and s2. Step 2: Compare the series using compare() function in the Pandas series. Step 3: Print their difference.


2 Answers

Use .shift and compare:

dataframe['changed'] = dataframe['ColumnB'] == dataframe['ColumnB'].shift(1).fillna(dataframe['ColumnB']) 
like image 28
Kartik Avatar answered Sep 18 '22 04:09

Kartik


I get better performance with ne instead of using the actual != comparison:

df['changed'] = df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int) 

Timings

Using the following setup to produce a larger dataframe:

df = pd.concat([df]*10**5, ignore_index=True)  

I get the following timings:

%timeit df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int) 10 loops, best of 3: 38.1 ms per loop  %timeit (df.ColumnB != df.ColumnB.shift()).astype(int) 10 loops, best of 3: 77.7 ms per loop  %timeit df['ColumnB'] == df['ColumnB'].shift(1).fillna(df['ColumnB']) 10 loops, best of 3: 99.6 ms per loop  %timeit (df.ColumnB.ne(df.ColumnB.shift())).astype(int) 10 loops, best of 3: 19.3 ms per loop 
like image 138
root Avatar answered Sep 18 '22 04:09

root