I'm trying to write code that will read from a set of CSVs named <code>my_file_*.csv</code> into a Dask dataframe. Then I want to set the partitions based on the length of the CSV. I'm trying to map a function on each partition and in order to do that, each partition must be the whole CSV. I've tried to reset the index, and then set partitions based on the length of each CSV but it looks like the index of the Dask dataframe is not unique. Is there a better way to partition based on the length of each CSV?

So one partition should contain exactly one file? You cold do: <pre class="prettyprint"><code>import dask.dataframe as dd ddf = dd.read_csv(my_file_*.csv, blocksize = None) </code></pre> Setting blocksize to None makes sure that files are not split up in several partitions. Therefore, <code>ddf</code> will be a dask dataframe containing one file per partition. You might want to check out the documentation: <ul> <li> general instructions how to generate dask dataframes from data</li> <li>details about <code>read_csv</code> </li> </ul>

Dask reading CSV, setting partition as CSV length

1 Answers

So one partition should contain exactly one file? You cold do:

import dask.dataframe as dd
ddf = dd.read_csv(my_file_*.csv, blocksize = None)

Setting blocksize to None makes sure that files are not split up in several partitions. Therefore, ddf will be a dask dataframe containing one file per partition.

You might want to check out the documentation:

general instructions how to generate dask dataframes from data
details about read_csv

146

answered Sep 30 '22 15:09

Arco Bast

Related questions
                            
                                How to deal with annoying gap in pandas bar plot
                            
                                IPython %run vs. import for loading settings
                            
                                Difference between type and object primitive objects
                            
                                Fill missing value based on value from another column in the same row
                            
                                pandas column values to row values
                            
                                Python coroutines
                            
                                adding x-www-form-urlencoded to post request
                            
                                Merging 2 csv data sets with Python a common ID column- one csv has multiple records for a unique ID
                            
                                What means of namespace object in Python
                            
                                Message box close immediately after using py2exe
                            
                                what is the best way to start google chrome and input a web address by pywinauto
                            
                                Entry point for a bokeh server
                            
                                ZeroMQ fails to .bind() on Docker on [0.0.0.0:5555] - address already in use. Why?
                            
                                How to combine multiple VUnit run.py files into a single VUnit run?
                            
                                pandas list of dictionary to separate columns
                            
                                Counter.most_common(n) how to override arbitrary ordering
                            
                                how to predict my own image using cnn in keras after training on MNIST dataset
                            
                                I am getting an error while creating a simple RDD in Spark
                            
                                TypeError: "unsupported operand type(s) for -: 'Timestamp' and 'str'" pandas
                            
                                DataFrameGroupBy diff() on condition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dask reading CSV, setting partition as CSV length

Tags:

python

csv

dask

distributed

abcdefg

People also ask

1 Answers

Arco Bast

Recent Activity

Donate For Us