In [2]: import pandas as pd
...:
...: # Original DataSet
...: d = {'A': [1,1,1,1,2,2,2,2,3],
...: 'B': ['a','a','a','x','b','b','b','x','c'],
...: 'C': [11,22,33,44,55,66,77,88,99],}
...:
...: df = pd.DataFrame(d)
...: df
Out[2]:
A B C
0 1 a 11
1 1 a 22
2 1 a 33
3 1 x 44
4 2 b 55
5 2 b 66
6 2 b 77
7 2 x 88
8 3 c 99
Given a dataframe, I would like a flexible, efficient way to reset specific values based on certain conditions in two columns.
Conditions:
Out[3]:
A B C
0 1 a 11
1 1 a 22
2 1 a 33
3 1 x 55
4 2 b 55
5 2 b 66
6 2 b 77
7 2 x 99
8 3 c 99
I learned I can accomplish this using iterrows()
(see below),
# Code that produces the above outcome
for idx, x_row in df[df['B'] == 'x'].iterrows():
df.loc[idx, 'C'] = df.loc[idx+1, 'C']
df
but I need to do this many times, and I understand iterrows()
is slow. Are there better pandas-y, broadcasting-like ways of getting the desired outcome more efficiently?
Vectorization is always the best choice. Pandas come with df. values() function to convert the data frame to a list of list format. It took 14 seconds to iterate through a data frame with 10 million records that are around 56x times faster than iterrows().
You can set cell value of pandas dataframe using df.at[row_label, column_label] = 'Cell Value'. It is the fastest method to set the value of the cell of the pandas dataframe. Dataframe at property of the dataframe allows you to access the single value of the row/column pair using the row and column labels.
By using apply and specifying one as the axis, we can run a function on every row of a dataframe. This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes.
iterrows() - used for iterating over the rows as (index, series) pairs. iteritems() - used for iterating over the (key, value) pairs.
This should do what you want:
df.C[df.B == 'x'] = df.C.shift(-1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With