so far my dataframe looks like this:
ID Area Stage
1 P X
2 Q X
3 P X
4 Q Y
I would like to replace the area 'Q' with 'P' for every row where the Stage is equal to 'X'.
So the result should look like:
ID Area Stage
1 P X
2 P X
3 P X
4 Q Y
I tried:
data.query('Stage in ["X"]')['Area']=data.query('Stage in ["X"]')['Area'].replace('Q','P')
It does not work. Help is appreciated! :)
You can extract a column of pandas DataFrame based on another value by using the DataFrame. query() method. The query() is used to query the columns of a DataFrame with a boolean expression.
you can use 2 boolean conditions and use loc
:
df.loc[df['Area'].eq("Q") & df['Stage'].eq('X'),'Area']='P'
print(df)
ID Area Stage
0 1 P X
1 2 P X
2 3 P X
3 4 Q Y
Or np.where
df['Area'] = np.where(df['Area'].eq("Q") & df['Stage'].eq('X'),'P',df['Area'])
Could you please try following.
import pandas as pd
import numpy as np
df['Area']=np.where(df['Stage']=='X','P',df['Area'])
You can use loc
to specify where you want to replace, and pass the replaced series to the assignment:
df.loc[df['Stage']=='X', 'Area'] = df['Area'].replace('Q','P')
Output:
ID Area Stage
0 1 P X
1 2 P X
2 3 P X
3 4 Q Y
Note : this not an answer proposing a new way to do, but a comparison of the execution time each needs
All the proposals in the answers are quite 'magic' doing the job in one line of code thanks to pandas/numpy, anyway to do the job is good but to do it quickly is better, so I wanted to compare the execution time of each.
Here my program, in the loops I modify the dataframe two times to let it unchanged from a turn to the next ( I am not a Python programmer as you so sorry in advance if the way to do is 'poor') :
import pandas as pd
import numpy as np
import time
df=pd.DataFrame({'ID' : [i for i in range(1,1000)],
'Area' : ['P' if (i & 1) else 'Q' for i in range(1,1000)],
'Stage' : [ 'X' if (i & 2) else 'Y' for i in range(1,1000)]})
t0=time.process_time()
for i in range(1,100):
df.loc[df['Stage']=='X', 'Area'] = df['Area'].replace('Q','q')
df.loc[df['Stage']=='X', 'Area'] = df['Area'].replace('q','Q')
print("Quang Hoang", '%.2f' % (time.process_time() - t0))
t0=time.process_time()
for i in range(1,100):
df.loc[df['Stage'] == 'X', 'Area'] = 'q'
df.loc[df['Stage'] == 'X', 'Area'] = 'Q'
print("Joe Ferndz", '%.2f' % (time.process_time() - t0))
t0=time.process_time()
for i in range(1,100):
df.loc[df['Area'].eq("Q") & df['Stage'].eq('X'),'Area']='q'
df.loc[df['Area'].eq("q") & df['Stage'].eq('X'),'Area']='Q'
print("anky 1", '%.2f' % (time.process_time() - t0))
t0=time.process_time()
for i in range(1,100):
df['Area'] = np.where(df['Area'].eq("Q") & df['Stage'].eq('X'),'q',df['Area'])
df['Area'] = np.where(df['Area'].eq("q") & df['Stage'].eq('X'),'Q',df['Area'])
print("anky 2", '%.2f' % (time.process_time() - t0))
t0=time.process_time()
for i in range(1,100):
df['Area']=np.where(df['Stage']=='X','q',df['Area'])
df['Area']=np.where(df['Stage']=='X','Q',df['Area'])
print("RavinderSingh13", '%.2f' % (time.process_time() - t0))
On my PI 4 the result is :
Quang Hoang 1.60
Joe Ferndz 1.12
anky 1 1.55
anky 2 0.86
RavinderSingh13 0.38
if I use a dataframe having 100000 lines rather than 1000 the result is :
Quang Hoang 10.79
Joe Ferndz 6.61
anky 1 10.91
anky 2 9.64
RavinderSingh13 4.75
Note the proposals of Joe Ferndz and RavinderSingh13 suppose Area is only 'P' or 'Q'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With