Concatenating a dask dataframe and a pandas dataframe

Tags:

I have a dask dataframe (df) with around 250 million rows (from a 10Gb CSV file). I have another pandas dataframe (ndf) of 25,000 rows. I would like to add the first column of pandas dataframe to the dask dataframe by repeating every item 10,000 times each.

Here's the code that I tried. I have reduced the problem to a smaller size.

import dask.dataframe as dd
import pandas as pd
import numpy as np

pd.DataFrame(np.random.rand(25000, 2)).to_csv("tempfile.csv")
df = dd.read_csv("tempfile.csv")
ndf = pd.DataFrame(np.random.randint(1000, 3500, size=2500))
df['Node'] = np.repeat(ndf[0], 10)

With this code, I end up with an error.

ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.

I can perform a reset_index() followed by a set_index() to make df.known_divisions True for the dask dataframe. But it is a time consuming operation. Is there a better faster way to do what I am trying to do? Can I do this using pandas itself?

The end goal is to find rows from ndf where any of the corresponding rows from df matches some criteria.

350

asked Feb 15 '19 03:02

najeem

1 Answers

Your basic algorithm is "I'd like the first 10 values of df['Node'] to be set to the first value of ndf, the next 10 values to the next value of ndf, and so on". The reason this is hard in Dask, is because it doesn't know how many rows are in each partition: you are reading from CSVs, and the number of rows you get in X bytes depends on exactly what the data are like in each part. Other formats give you more information...

You will, therefore, certainly need two passes through the data. You could work with the index, to figure out divisions and potentially do some sorting. To my mind, the easiest thing you can do is simply to measure the division lengths, and so get the offset of the start of each:

lengths = df.map_partitions(len).compute()
offsets = np.cumsum(lengths.values)
offsets -= offsets[0]

and now use custom delayed function to work on the parts

@dask.delayed
def add_node(part, offset, ndf):
    index = pd.Series(range(offset, offset + len(part)) // 10,
                      index=part.index)  # 10 is the repeat factor
    part['Node'] = index.map(ndf)
    return part

df2 = dd.from_delayed([add_node(d, off, ndf) 
                       for d, off in zip(df.to_delayed(), offsets)])

119

answered Sep 20 '22 17:09

mdurant

Related questions
                            
                                A simple way to insert a table of contents in a multiple page pdf generated using PdfPages
                            
                                Reloading a Python module per process in the multiprocessing module
                            
                                Retrain Tensorflow final layer but still use previous Imagenet classes
                            
                                correct way to add custom (deep) copying logic to a python class
                            
                                How can I build a python project with osx environment on travis
                            
                                Scipy Sparse Cumsum
                            
                                Create dynamic parameters with pytest?
                            
                                Python: SSLError, bad handshake, Unexpected EOF
                            
                                multiprocessing.Pool spawning more processes than requested only on Google Cloud
                            
                                Why the difference in handling unbound locals in functions versus classes?
                            
                                Can't disable flask/werkzeug logging
                            
                                Handle Turkish uppercase and lowercase correctly, need to modify/override built-in functions?
                            
                                Python3 can't pickle _thread.RLock objects on list with multiprocessing
                            
                                implementing RNN with numpy
                            
                                'import quandl' produces 'Process finished with exit code -1073741819 (0xC0000005)'
                            
                                actor critic policy loss going to zero (with no improvement)
                            
                                How to properly set steps_per_epoch and validation_steps in Keras?
                            
                                How to sort records by sequence instead of name in Odoo OCA widget web_widget_x2many_2d_matrix?
                            
                                websocket._exceptions.WebSocketProxyException: failed CONNECT via proxy status: 503
                            
                                Pycharm: How to focus on Editor when hit a debug point

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Concatenating a dask dataframe and a pandas dataframe

Tags:

python

pandas

dataframe

dask

najeem

People also ask

1 Answers

mdurant

Recent Activity

Donate For Us