Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask reading CSV, setting partition as CSV length

I'm trying to write code that will read from a set of CSVs named my_file_*.csv into a Dask dataframe.

Then I want to set the partitions based on the length of the CSV. I'm trying to map a function on each partition and in order to do that, each partition must be the whole CSV.

I've tried to reset the index, and then set partitions based on the length of each CSV but it looks like the index of the Dask dataframe is not unique.

Is there a better way to partition based on the length of each CSV?

like image 244
abcdefg Avatar asked Mar 31 '17 19:03

abcdefg


People also ask

How do I make a DASK Dataframe with one file per partition?

You cold do: import dask.dataframe as dd ddf = dd.read_csv (my_file_*.csv, blocksize = None) Setting blocksize to None makes sure that files are not split up in several partitions. Therefore, ddf will be a dask dataframe containing one file per partition.

How do you read a dtype from a DASK Dataframe?

Dask dataframe tries to infer the dtype of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if the dtype is different later in the file (or in other files) this can cause issues.

Why can't DASK read multiple files at the same time?

The fact that dask.dataframe can read files individually, but not together, suggests that individual files have different numbers of columns. My suggestion would be to print/save the column information for each file that is loaded individually (with pandas or dask.dataframe), that could indicate the problematic file.

How do I read CSV files from external resources?

It can read CSV files from external resources (e.g. S3, HDFS) by providing a URL: Internally dd.read_csv uses pandas.read_csv () and supports many of the same keyword arguments with the same performance guarantees. See the docstring for pandas.read_csv () for more information on available keyword arguments. Absolute or relative filepath (s).


1 Answers

So one partition should contain exactly one file? You cold do:

import dask.dataframe as dd
ddf = dd.read_csv(my_file_*.csv, blocksize = None)

Setting blocksize to None makes sure that files are not split up in several partitions. Therefore, ddf will be a dask dataframe containing one file per partition.

You might want to check out the documentation:

  • general instructions how to generate dask dataframes from data
  • details about read_csv
like image 146
Arco Bast Avatar answered Sep 30 '22 15:09

Arco Bast