Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas alternative to apply - to create new column based on multiple columns

I have a Pandas dataframe and I would like to add a new column based on the values of the other columns. A minimal example illustrating my usecase is below.

df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])
df

    a   b   c
---------------
0   4   5   19
1   1   2   0
2   2   5   9
3   8   2   5

x = df.sample(n=2)
x

    a   b   c
---------------
3   8   2   5
1   1   2   0

def get_new(row):
    a, b, c = row
    return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)

y = x.apply(lambda row: get_new(row), axis=1)
x['new'] = y
x

    a   b   c   new
--------------------
3   8   2   5   0
1   1   2   0   5

Note: The original dataframe has ~4 million rows and ~6 columns. The number of rows in the sample might vary between 50 and 500. I am running on a 64-bit machine with 8 GB RAM.

The above works, except that it is quite slow (takes about 15 seconds for me). I also tried using x.itertuples() instead of apply and there is not much of an improvement in this case.

  1. It seems that apply(with axis=1) is slow since it does not make use of the vectorized operations. Is there some way I could achieve this in a faster way?

  2. Can the filtering(in the get_new function) be modified or made more efficient compared to using conditional boolean variables, as I currently have?

  3. Can I in some way use numpy here for some speedup?

Edit: df.sample() is also quite slow and I cannot use .iloc or .loc since I am further modifying the sample and do not wish for this to affect the original dataframe.

like image 814
swathis Avatar asked Mar 01 '18 15:03

swathis


People also ask

How replace column values in pandas based on multiple conditions?

You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.

How do you create a new column in pandas by subtracting two columns?

We can create a function specifically for subtracting the columns, by taking column data as arguments and then using the apply method to apply it to all the data points throughout the column.


1 Answers

I see a reasonable performance improvement by using .loc rather than chained indexing:

import random, pandas as pd, numpy as np

df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])

df = pd.concat([df]*1000000)

x = df.sample(n=2)

def get_new(row):
    a, b, c = row
    return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)

def get_new2(row):
    a, b, c = row
    return random.choice(df.loc[(df['a'] != a) & (df['b'] == b) & (df['c'] != c), 'c'].values)


%timeit x.apply(lambda row: get_new(row), axis=1)   # 159ms
%timeit x.apply(lambda row: get_new2(row), axis=1)  # 119ms
like image 125
jpp Avatar answered Sep 16 '22 15:09

jpp