Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strategy for partitioning dask dataframes efficiently

The documentation for Dask talks about repartioning to reduce overhead here.

They however seem to indicate you need some knowledge of what your dataframe will look like beforehand (ie that there will 1/100th of the data expected).

Is there a good way to repartition sensibly without making assumptions? At the moment I just repartition with npartitions = ncores * magic_number, and set force to True to expand partitions if need be. This one size fits all approach works but is definitely suboptimal as my dataset varies in size.

The data is time series data, but unfortunately not at regular intervals, I've used repartition by time frequency in the past but this would be suboptimal because of how irregular the data is (sometimes nothing for minutes then thousands in seconds)

like image 683
Samantha Hughes Avatar asked Jun 20 '17 15:06

Samantha Hughes


People also ask

Is Dask faster than Pandas?

For data that fits into RAM, pandas can often be faster and easier to use than Dask DataFrame. While “Big Data” tools can be exciting, they are almost always worse than normal data tools while those remain appropriate.

Does Dask use multiple cores?

Dask is a useful tool when working with large analyses (either in space or time) as it breaks data into manageable chunks that can be easily stored in memory. It can also use multiple computing cores to speed up computations.


1 Answers

As of Dask 2.0.0 you may call .repartition(partition_size="100MB").

This method performs an object-considerate (.memory_usage(deep=True)) breakdown of partition size. It will join smaller partitions, or split partitions that have grown too large.

Dask's Documentation also outlines the usage.

like image 162
Wes Roach Avatar answered Oct 02 '22 13:10

Wes Roach