I'm trying to write code that will read from a set of CSVs named my_file_*.csv
into a Dask dataframe.
Then I want to set the partitions based on the length of the CSV. I'm trying to map a function on each partition and in order to do that, each partition must be the whole CSV.
I've tried to reset the index, and then set partitions based on the length of each CSV but it looks like the index of the Dask dataframe is not unique.
Is there a better way to partition based on the length of each CSV?
You cold do: import dask.dataframe as dd ddf = dd.read_csv (my_file_*.csv, blocksize = None) Setting blocksize to None makes sure that files are not split up in several partitions. Therefore, ddf will be a dask dataframe containing one file per partition.
Dask dataframe tries to infer the dtype of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if the dtype is different later in the file (or in other files) this can cause issues.
The fact that dask.dataframe can read files individually, but not together, suggests that individual files have different numbers of columns. My suggestion would be to print/save the column information for each file that is loaded individually (with pandas or dask.dataframe), that could indicate the problematic file.
It can read CSV files from external resources (e.g. S3, HDFS) by providing a URL: Internally dd.read_csv uses pandas.read_csv () and supports many of the same keyword arguments with the same performance guarantees. See the docstring for pandas.read_csv () for more information on available keyword arguments. Absolute or relative filepath (s).
So one partition should contain exactly one file? You cold do:
import dask.dataframe as dd
ddf = dd.read_csv(my_file_*.csv, blocksize = None)
Setting blocksize to None makes sure that files are not split up in several partitions. Therefore, ddf
will be a dask dataframe containing one file per partition.
You might want to check out the documentation:
read_csv
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With