Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas apply in parallel when axis=0

I want to apply some function on all pandas columns in parallel. For example, I want to do this in parallel:

def my_sum(x, a):
    return x + a


df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
                   'num_wings': [2, 0, 0, 0]})
df.apply(lambda x: my_sum(x, 2), axis=0)

I know there is a swifter package, but it doesn't support axis=0 in apply:

NotImplementedError: Swifter cannot perform axis=0 applies on large datasets. Dask currently does not have an axis=0 apply implemented. More details at https://github.com/jmcarpenter2/swifter/issues/10

Dask also doesn't support this for axis=0 (according to documentation in swifter).

I have googled several sources but couldn't find an easy solution.

Can't believe this is so complicated in pandas.

like image 228
Mislav Avatar asked Mar 19 '20 14:03

Mislav


1 Answers

Koalas provides a way to perform computation on a dataframe in parallel. It accepts the same commands as pandas but performs them on a Apache Spark engine in the background.

Note that you do need the parallel infrastructure available in order to use it properly.

On their blog post they compare the following chunks of code:

pandas:

import pandas as pd
df = pd.DataFrame({'x': [1, 2], 'y': [3, 4], 'z': [5, 6]})
# Rename columns
df.columns = [‘x’, ‘y’, ‘z1’]
# Do some operations in place
df[‘x2’] = df.x * df.x

Koalas:

import databricks.koalas as ks
df = ks.DataFrame({'x': [1, 2], 'y': [3, 4], 'z': [5, 6]})
# Rename columns
df.columns = [‘x’, ‘y’, ‘z1’]
# Do some operations in place
df[‘x2’] = df.x * df.x
like image 56
jorijnsmit Avatar answered Oct 19 '22 06:10

jorijnsmit