I have a Pandas DataFrame like this:
col1 col2 col3 1 0.2 0.3 0.3 2 0.2 0.3 0.3 3 0 0.4 0.4 4 0 0 0.3 5 0 0 0 6 0.1 0.4 0.4
I want to replace the col1
values with the values in the second column (col2
) only if col1
values are equal to 0, and after (for the zero values remaining), do it again but with the third column (col3
). The Desired Result is the next one:
col1 col2 col3 1 0.2 0.3 0.3 2 0.2 0.3 0.3 3 0.4 0.4 0.4 4 0.3 0 0.3 5 0 0 0 6 0.1 0.4 0.4
I did it using the pd.replace
function, but it seems too slow.. I think must be a faster way to accomplish that.
df.col1.replace(0,df.col2,inplace=True) df.col1.replace(0,df.col3,inplace=True)
is there a faster way to do that?, using some other function instead of the pd.replace
function?
In order to replace a value in Pandas DataFrame, use the replace() method with the column the from and to values.
Pandas replace multiple values in column replace. By using DataFrame. replace() method we will replace multiple values with multiple new strings or text for an individual DataFrame column. This method searches the entire Pandas DataFrame and replaces every specified value.
Using np.where
is faster. Using a similar pattern as you used with replace
:
df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1']) df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
However, using a nested np.where
is slightly faster:
df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1'])
Timings
Using the following setup to produce a larger sample DataFrame and timing functions:
df = pd.concat([df]*10**4, ignore_index=True) def root_nested(df): df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1']) return df def root_split(df): df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1']) df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1']) return df def pir2(df): df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0) return df def pir2_2(df): slc = (df.values != 0).argmax(axis=1) return df.values[np.arange(slc.shape[0]), slc] def andrew(df): df.col1[df.col1 == 0] = df.col2 df.col1[df.col1 == 0] = df.col3 return df def pablo(df): df['col1'] = df['col1'].replace(0,df['col2']) df['col1'] = df['col1'].replace(0,df['col3']) return df
I get the following timings:
%timeit root_nested(df.copy()) 100 loops, best of 3: 2.25 ms per loop %timeit root_split(df.copy()) 100 loops, best of 3: 2.62 ms per loop %timeit pir2(df.copy()) 100 loops, best of 3: 6.25 ms per loop %timeit pir2_2(df.copy()) 1 loop, best of 3: 2.4 ms per loop %timeit andrew(df.copy()) 100 loops, best of 3: 8.55 ms per loop
I tried timing your method, but it's been running for multiple minutes without completing. As a comparison, timing your method on just the 6 row example DataFrame (not the much larger one tested above) took 12.8 ms.
I'm not sure if it's faster, but you're right that you can slice the dataframe to get your desired result.
df.col1[df.col1 == 0] = df.col2 df.col1[df.col1 == 0] = df.col3 print(df)
Output:
col1 col2 col3 0 0.2 0.3 0.3 1 0.2 0.3 0.3 2 0.4 0.4 0.4 3 0.3 0.0 0.3 4 0.0 0.0 0.0 5 0.1 0.4 0.4
Alternatively if you want it to be more terse (though I don't know if it's faster) you can combine what you did with what I did.
df.col1[df.col1 == 0] = df.col2.replace(0, df.col3) print(df)
Output:
col1 col2 col3 0 0.2 0.3 0.3 1 0.2 0.3 0.3 2 0.4 0.4 0.4 3 0.3 0.0 0.3 4 0.0 0.0 0.0 5 0.1 0.4 0.4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With