I have a Pandas dataframe and I would like to add a new column based on the values of the other columns. A minimal example illustrating my usecase is below.
df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])
df
a b c
---------------
0 4 5 19
1 1 2 0
2 2 5 9
3 8 2 5
x = df.sample(n=2)
x
a b c
---------------
3 8 2 5
1 1 2 0
def get_new(row):
a, b, c = row
return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)
y = x.apply(lambda row: get_new(row), axis=1)
x['new'] = y
x
a b c new
--------------------
3 8 2 5 0
1 1 2 0 5
Note: The original dataframe has ~4 million rows and ~6 columns. The number of rows in the sample might vary between 50 and 500. I am running on a 64-bit machine with 8 GB RAM.
The above works, except that it is quite slow (takes about 15 seconds for me). I also tried using x.itertuples()
instead of apply
and there is not much of an improvement in this case.
It seems that apply(with axis=1) is slow since it does not make use of the vectorized operations. Is there some way I could achieve this in a faster way?
Can the filtering(in the get_new
function) be modified or made more efficient compared to using conditional boolean variables, as I currently have?
Can I in some way use numpy here for some speedup?
Edit: df.sample()
is also quite slow and I cannot use .iloc
or .loc
since I am further modifying the sample and do not wish for this to affect the original dataframe.
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
We can create a function specifically for subtracting the columns, by taking column data as arguments and then using the apply method to apply it to all the data points throughout the column.
I see a reasonable performance improvement by using .loc
rather than chained indexing:
import random, pandas as pd, numpy as np
df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])
df = pd.concat([df]*1000000)
x = df.sample(n=2)
def get_new(row):
a, b, c = row
return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)
def get_new2(row):
a, b, c = row
return random.choice(df.loc[(df['a'] != a) & (df['b'] == b) & (df['c'] != c), 'c'].values)
%timeit x.apply(lambda row: get_new(row), axis=1) # 159ms
%timeit x.apply(lambda row: get_new2(row), axis=1) # 119ms
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With