How to apply funtion to single Column of large dataset using Dask?

Tags:

If apply funtion to calculate logaritm at single column of large dataset using Dask, How can I do that?

df_train.apply(lambda x: np.log1p(x), axis=1 , meta={'column_name':'float32'}).compute()

The dataset is very large (125 Millions of rows), How can I do that?

562

asked Mar 09 '18 17:03

ambigus9

1 Answers

You have a few options:

Use dask.array functions

Just like how your pandas dataframe can use numpy functions

import numpy as np
result = np.log1p(df.x)

Dask dataframes can use dask array functions

import dask.array as da
result = da.log1p(df.x)

Map Partitions

But maybe no such dask.array function exists for your particular function. You can always use map_partitions, to apply any function that you would normally do on pandas dataframes across all of the pandas dataframes that make up your dask dataframe

Pandas

result = f(df.x)

Dask DataFrame

result = df.x.map_partitions(f)

Map

You can always use the map or apply(axis=0) methods, but just like in Pandas these are usually very bad for performance.

119

answered Oct 18 '22 21:10

MRocklin

Related questions
                            
                                How to force tkinter text widget to stay on one line
                            
                                Is it ok to use the anaconda distribution for web development?
                            
                                How to use DNS resolver in Python 3?
                            
                                Sampling Without Replacement Probabilities
                            
                                how to deal with sqlalchemy.util._collections.result？
                            
                                Matplotlib: How to skip a range of hours when plotting with a datetime axis?
                            
                                Get List of all attributes which are not inherited
                            
                                Does any magic happen when I call `super(some_cls)`?
                            
                                No such file or directory: 'tesseract': 'tesseract' even though where to find tesseract is specified in pytesseract.py
                            
                                What library creates simple 2D graphics, and works in both Pythonista and "normal" Python [closed]
                            
                                Cropping a raster file using GDAL w/ Python
                            
                                apache_beam.transforms.util.Reshuffle() not available for GCP Dataflow
                            
                                Does calling the model.fit method again reinitialize the already trained weights?
                            
                                Adjust all nested lists to the same length
                            
                                Error importing tag from pattern3.en
                            
                                What does the `overwrite` parameter in Pandas DataFrame.update() function do?
                            
                                Why is my prof's version of LU decomposition faster than mine? Python numpy
                            
                                How to Change the DPI of an image using the PIL without saving?
                            
                                DRF How to serialize models inheritance ? (read/write)
                            
                                Why does print(__name__) give 'builtins'?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to apply funtion to single Column of large dataset using Dask?

Tags:

python

logarithm

dask