Repartition Dask DataFrame to get even partitions

Tags:

I have a Dask DataFrames that contains index which is not unique (client_id). Repartitioning and resetting index ends up with very uneven partitions - some contains only a few rows, some thousands. For instance the following code:

for p in range(ddd.npartitions):
    print(len(ddd.get_partition(p)))

prints out something like that:

My DataFrame is one-hot encoded and has over 500 columns. Larger partitions don't fit in memory. I wanted to repartition the DataFrame to have partitions even in size. Do you know an efficient way to do this?

EDIT 1

Simple reproduce:

df = pd.DataFrame({'x':np.arange(0,10000),'y':np.arange(0,10000)})
df2 = pd.DataFrame({'x':np.append(np.arange(0,4995),np.arange(5000,10000,1000)),'y2':np.arange(0,10000,2)})
dd_df = dd.from_pandas(df, npartitions=10).set_index('x')
dd_df2= dd.from_pandas(df2, npartitions=5).set_index('x')
new_ddf=dd_df.merge(dd_df2, how='right')
#new_ddf = new_ddf.reset_index().set_index('x')
#new_ddf = new_ddf.repartition(npartitions=2)
new_ddf.divisions
for p in range(new_ddf.npartitions):
    print(len(new_ddf.get_partition(p)))

Note the last partitions (one single element):

Even when we uncomment the commented lines, partitions remain uneven in the size.

Edit II: Walkoround

Simple wlakoround can be achieved by the following code. Is there a more elgant way to do this (more in a Dask way)?

def repartition(ddf, npartitions=None):
    MAX_PART_SIZE = 100*1024

    if npartitions is None:
        npartitions = ddf.npartitions

    one_row_size = sum([dt.itemsize for dt in ddf.dtypes])
    length = len(ddf)

    requested_part_size = length/npartitions*one_row_size
    if requested_part_size <= MAX_PART_SIZE:
        np = npartitions
    else:
        np = length*one_row_size/MAX_PART_SIZE

    chunksize = int(length/np)


    vc = ddf.index.value_counts().to_frame(name='count').compute().sort_index()

    vsum = 0
    divisions = [ddf.divisions[0]]
    for i,v in vc.iterrows():
        vsum+=v['count']
        if vsum > chunksize:
            divisions.append(i)
            vsum = 0
    divisions.append(ddf.divisions[-1])


    return ddf.repartition(divisions=divisions, force=True)

922

asked Oct 04 '18 09:10

Szymon

1 Answers

You're correct that .repartition won't do the trick since it doesn't handle any of the logic for computing divisions and just tries to combine the existing partitions wherever possible. Here's a solution I came up with for the same problem:

def _rebalance_ddf(ddf):
    """Repartition dask dataframe to ensure that partitions are roughly equal size.

    Assumes `ddf.index` is already sorted.
    """
    if not ddf.known_divisions:  # e.g. for read_parquet(..., infer_divisions=False)
        ddf = ddf.reset_index().set_index(ddf.index.name, sorted=True)
    index_counts = ddf.map_partitions(lambda _df: _df.index.value_counts().sort_index()).compute()
    index = np.repeat(index_counts.index, index_counts.values)
    divisions, _ = dd.io.io.sorted_division_locations(index, npartitions=ddf.npartitions)
    return ddf.repartition(divisions=divisions)

The internal function sorted_division_locations does what you want already, but it only works on an actual list-like, not a lazy dask.dataframe.Index. This avoids pulling the full index in case there are many duplicates and instead just gets the counts and reconstructs locally from that.

If your dataframe is so large that even the index won't fit in memory then you'd need to do something even more clever.

156

answered Oct 10 '22 12:10

bnaul

Related questions
                            
                                Pip install in Spyder
                            
                                Flask - Toggle button with dynamic label
                            
                                Importing data from an excel file using python into SQL Server
                            
                                Save Pandas DataFrames with formulas to xlsx files
                            
                                Does the lock in asyncio.Condition have other purpose besides compatibility with threading.Condition?
                            
                                error "socket.timeout: The read operation timed out" while installing a python module
                            
                                Extract specific pages of PDF and save it with Python
                            
                                Per server prefixs
                            
                                Obtain `min` and `idxmin` (or `max` and `idxmax`) at the same time ("simultaneously")?
                            
                                Why is the order of Python sets not deterministic even when PYTHONHASHSEED=0?
                            
                                How do I use google.auth instead of oauth2client in Python to get access to my Google Calendar
                            
                                How can we define one-to-one, one-to-many, many-to-one, and many-to-many LSTM neural networks in Keras? [duplicate]
                            
                                Pandas Python: Concatenate dataframes having same columns
                            
                                PyTorch Gradient Descent
                            
                                How to close aiohttp ClientSession
                            
                                Unable to run Tracking on Open CV 3.4.1 on Python 3.6.6
                            
                                Keras: Accuracy Drops While Finetuning Inception
                            
                                Pytest: pass one fixture to another
                            
                                pandas idxmax: return all rows in case of ties
                            
                                How to obtain the chi squared value as an output of scipy.optimize.curve_fit?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Repartition Dask DataFrame to get even partitions

Tags:

python

dataframe

parquet

dask

Szymon

People also ask

1 Answers

bnaul

Recent Activity

Donate For Us