Pandas alternative to apply - to create new column based on multiple columns

Tags:

I have a Pandas dataframe and I would like to add a new column based on the values of the other columns. A minimal example illustrating my usecase is below.

df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])
df

    a   b   c
---------------
0   4   5   19
1   1   2   0
2   2   5   9
3   8   2   5

x = df.sample(n=2)
x

    a   b   c
---------------
3   8   2   5
1   1   2   0

def get_new(row):
    a, b, c = row
    return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)

y = x.apply(lambda row: get_new(row), axis=1)
x['new'] = y
x

    a   b   c   new
--------------------
3   8   2   5   0
1   1   2   0   5

Note: The original dataframe has ~4 million rows and ~6 columns. The number of rows in the sample might vary between 50 and 500. I am running on a 64-bit machine with 8 GB RAM.

The above works, except that it is quite slow (takes about 15 seconds for me). I also tried using x.itertuples() instead of apply and there is not much of an improvement in this case.

It seems that apply(with axis=1) is slow since it does not make use of the vectorized operations. Is there some way I could achieve this in a faster way?
Can the filtering(in the get_new function) be modified or made more efficient compared to using conditional boolean variables, as I currently have?
Can I in some way use numpy here for some speedup?

Edit: df.sample() is also quite slow and I cannot use .iloc or .loc since I am further modifying the sample and do not wish for this to affect the original dataframe.

814

asked Mar 01 '18 15:03

swathis

1 Answers

I see a reasonable performance improvement by using .loc rather than chained indexing:

import random, pandas as pd, numpy as np

df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])

df = pd.concat([df]*1000000)

x = df.sample(n=2)

def get_new(row):
    a, b, c = row
    return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)

def get_new2(row):
    a, b, c = row
    return random.choice(df.loc[(df['a'] != a) & (df['b'] == b) & (df['c'] != c), 'c'].values)


%timeit x.apply(lambda row: get_new(row), axis=1)   # 159ms
%timeit x.apply(lambda row: get_new2(row), axis=1)  # 119ms

125

answered Sep 16 '22 15:09

jpp

Related questions
                            
                                Vectorizing the multivariate normal CDF (cumulative density function) in Python
                            
                                How can I be sure I installed pip correctly on Mac OSX?
                            
                                Is it possible to use a fixture in another fixture and both in a test?
                            
                                sqlalchemy.exc.InvalidRequestError: Multiple classes found for path in the registry of this declarative base
                            
                                Hash Checking in setup.py install requires
                            
                                Forward Kinematics for Baxter
                            
                                Pandas converting numbers to strings - unexpected results
                            
                                Uploading and processing a csv file in django using ModelForm
                            
                                Dealing with the "StanfordTokenizer will be deprecated in version 3.2.5" Warning [closed]
                            
                                Write test-cases for python setuptools entry-points plugins
                            
                                Scraping <td> values on table generate by Javascript to Python
                            
                                Pywinauto - Can't connect to office documents using the UIA backend
                            
                                Control tick-labels from multi-level FactorRange
                            
                                Loading trained Tensorflow model into estimator
                            
                                Unit test request retry python
                            
                                Moving Collections between axes
                            
                                How to chain multiple command line responses in Python?
                            
                                Euclidean distance, different results between Scipy, pure Python, and Java
                            
                                Scipy randint vs numpy randint

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas alternative to apply - to create new column based on multiple columns

Tags:

python

pandas

dataframe

numpy

apply

swathis

People also ask

1 Answers

jpp

Recent Activity

Donate For Us