Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most pythonic way to concatenate pandas cells with conditions

I have the following Pandas DataFrame, with city and arr columns:

city      arr  final_target
paris     11   paris_11
paris     12   paris_12
dallas    22   dallas
miami     15   miami
paris     16   paris_16

My goal is to fill the final_target column concatenating paris and arr number, when city name is Paris, and just filling with the name when the name is not Paris.

What is the most pythonic way to do this ?

like image 616
Alex Dana Avatar asked Dec 17 '22 11:12

Alex Dana


1 Answers

What is the most pythonic way to do this ?

It depends by definion. If it is more preferable, most common and fastest way then np.where solution is here most pythonic way.


Use numpy.where, if need pandaic also this solutions are vectorized, so should be more preferable like apply (loops under the hood):

df['final_target'] = np.where(df['city'].eq('paris'), 
                              df['city'] + '_' + df['arr'].astype(str), 
                              df['city'])

Pandas alternatives:

df['final_target'] = df['city'].mask(df['city'].eq('paris'), 
                                     df['city'] + '_' + df['arr'].astype(str))

df['final_target'] = df['city'].where(df['city'].ne('paris'), 
                                      df['city'] + '_' + df['arr'].astype(str))
print (df)
     city  arr final_target
0   paris   11     paris_11
1   paris   12     paris_12
2  dallas   22       dallas
3   miami   15        miami
4   paris   16     paris_16

Performance:

#50k rows
df = pd.concat([df] * 10000, ignore_index=True)
    

In [157]: %%timeit
     ...: df['final_target'] = np.where(df['city'].eq('paris'), 
     ...:                               df['city'] + '_' + df['arr'].astype(str), 
     ...:                               df['city'])
     ...:                               
48.6 ms ± 444 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [158]: %%timeit
     ...: df['city'] + (df['city'] == 'paris')*('_' + df['arr'].astype(str))
     ...: 
     ...: 
49.2 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [159]: %%timeit
     ...: df['final_target'] = df['city']
     ...: df.loc[df['city'] == 'paris', 'final_target'] +=  '_' + df['arr'].astype(str)
     ...: 
63.8 ms ± 764 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [160]: %%timeit
     ...: df['final_target'] = df.apply(lambda x: x.city + '_' + str(x.arr) if x.city == 'paris' else x.city, axis = 1)
     ...: 
     ...: 
1.33 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 53
jezrael Avatar answered Feb 15 '23 23:02

jezrael