The documentation for Dask talks about repartioning to reduce overhead here. They however seem to indicate you need some knowledge of what your dataframe will look like beforehand (ie that there will 1/100th of the data expected). Is there a good way to repartition sensibly without making assumptions? At the moment I just repartition with <code>npartitions = ncores * magic_number</code>, and set force to <code>True</code> to expand partitions if need be. This one size fits all approach works but is definitely suboptimal as my dataset varies in size. The data is time series data, but unfortunately not at regular intervals, I've used repartition by time frequency in the past but this would be suboptimal because of how irregular the data is (sometimes nothing for minutes then thousands in seconds)

As of Dask 2.0.0 you may call <code>.repartition(partition_size="100MB")</code>. This method performs an object-considerate (<code>.memory_usage(deep=True)</code>) breakdown of partition size. It will join smaller partitions, or split partitions that have grown too large. Dask's Documentation also outlines the usage.

Strategy for partitioning dask dataframes efficiently

Tags:

python

optimization

dataframe

dask

The documentation for Dask talks about repartioning to reduce overhead here.

They however seem to indicate you need some knowledge of what your dataframe will look like beforehand (ie that there will 1/100th of the data expected).

Is there a good way to repartition sensibly without making assumptions? At the moment I just repartition with npartitions = ncores * magic_number, and set force to True to expand partitions if need be. This one size fits all approach works but is definitely suboptimal as my dataset varies in size.

The data is time series data, but unfortunately not at regular intervals, I've used repartition by time frequency in the past but this would be suboptimal because of how irregular the data is (sometimes nothing for minutes then thousands in seconds)

683

asked Jun 20 '17 15:06

Samantha Hughes

1 Answers

As of Dask 2.0.0 you may call .repartition(partition_size="100MB").

This method performs an object-considerate (.memory_usage(deep=True)) breakdown of partition size. It will join smaller partitions, or split partitions that have grown too large.

Dask's Documentation also outlines the usage.

162

answered Oct 02 '22 13:10

Wes Roach

Related questions
                            
                                Python or Ruby Interpreter on iOS [closed]
                            
                                Python: multiprocessing.map: If one process raises an exception, why aren't other processes' finally blocks called?
                            
                                Semantics of tuple unpacking in python
                            
                                try ... except ... as error in Python 2.5 - Python 3.x
                            
                                Why is lxml.etree.iterparse() eating up all my memory?
                            
                                TypeError: unsupported operand type(s) for %: 'NoneType' and 'str'
                            
                                Calling R script from python using rpy2
                            
                                How do I extract all the values of a specific key from a list of dictionaries?
                            
                                python pandas timeseries plots, how to set xlim and xticks outside ts.plot()?
                            
                                Check if model field exists in Django
                            
                                Deleting rows with Python in a CSV file
                            
                                Python -- read_pickle ImportError: No module named indexes.base
                            
                                NoReturn vs. None in "void" functions - type annotations in Python 3.6
                            
                                geopandas point in polygon
                            
                                Python: How to read huge text file into memory
                            
                                Boost and Python 3.x
                            
                                How to redefine a color for a specific value in a matplotlib colormap
                            
                                What's the difference between Model.id and Model.pk in django?
                            
                                Django: Insert row into database
                            
                                Why does indexing numpy arrays with brackets and commas differ in behavior?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With