If apply funtion to calculate logaritm at single column of large dataset using Dask, How can I do that?
df_train.apply(lambda x: np.log1p(x), axis=1 , meta={'column_name':'float32'}).compute()
The dataset is very large (125 Millions of rows), How can I do that?
The original pandas query took 182 seconds and the optimized Dask query took 19 seconds, which is about 10 times faster. Dask can provide performance boosts over pandas because it can execute common operations in parallel, where pandas is limited to a single core.
You have a few options:
Just like how your pandas dataframe can use numpy functions
import numpy as np
result = np.log1p(df.x)
Dask dataframes can use dask array functions
import dask.array as da
result = da.log1p(df.x)
But maybe no such dask.array function exists for your particular function. You can always use map_partitions, to apply any function that you would normally do on pandas dataframes across all of the pandas dataframes that make up your dask dataframe
Pandas
result = f(df.x)
Dask DataFrame
result = df.x.map_partitions(f)
You can always use the map
or apply(axis=0)
methods, but just like in Pandas these are usually very bad for performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With