I want to apply some function on all pandas columns in parallel. For example, I want to do this in parallel:
def my_sum(x, a):
return x + a
df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0]})
df.apply(lambda x: my_sum(x, 2), axis=0)
I know there is a swifter
package, but it doesn't support axis=0
in apply:
NotImplementedError: Swifter cannot perform axis=0 applies on large datasets. Dask currently does not have an axis=0 apply implemented. More details at https://github.com/jmcarpenter2/swifter/issues/10
Dask also doesn't support this for axis=0
(according to documentation in swifter).
I have googled several sources but couldn't find an easy solution.
Can't believe this is so complicated in pandas.
Koalas provides a way to perform computation on a dataframe in parallel. It accepts the same commands as pandas
but performs them on a Apache Spark engine in the background.
Note that you do need the parallel infrastructure available in order to use it properly.
On their blog post they compare the following chunks of code:
pandas:
import pandas as pd
df = pd.DataFrame({'x': [1, 2], 'y': [3, 4], 'z': [5, 6]})
# Rename columns
df.columns = [‘x’, ‘y’, ‘z1’]
# Do some operations in place
df[‘x2’] = df.x * df.x
Koalas:
import databricks.koalas as ks
df = ks.DataFrame({'x': [1, 2], 'y': [3, 4], 'z': [5, 6]})
# Rename columns
df.columns = [‘x’, ‘y’, ‘z1’]
# Do some operations in place
df[‘x2’] = df.x * df.x
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With