Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to apply funtion to single Column of large dataset using Dask?

If apply funtion to calculate logaritm at single column of large dataset using Dask, How can I do that?

df_train.apply(lambda x: np.log1p(x), axis=1 , meta={'column_name':'float32'}).compute()

The dataset is very large (125 Millions of rows), How can I do that?

like image 562
ambigus9 Avatar asked Mar 09 '18 17:03

ambigus9


People also ask

Is Dask faster than pandas?

The original pandas query took 182 seconds and the optimized Dask query took 19 seconds, which is about 10 times faster. Dask can provide performance boosts over pandas because it can execute common operations in parallel, where pandas is limited to a single core.


1 Answers

You have a few options:

Use dask.array functions

Just like how your pandas dataframe can use numpy functions

import numpy as np
result = np.log1p(df.x)

Dask dataframes can use dask array functions

import dask.array as da
result = da.log1p(df.x)

Map Partitions

But maybe no such dask.array function exists for your particular function. You can always use map_partitions, to apply any function that you would normally do on pandas dataframes across all of the pandas dataframes that make up your dask dataframe

Pandas

result = f(df.x)

Dask DataFrame

result = df.x.map_partitions(f)

Map

You can always use the map or apply(axis=0) methods, but just like in Pandas these are usually very bad for performance.

like image 119
MRocklin Avatar answered Oct 18 '22 21:10

MRocklin