Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a compressed (gz) CSV file into a dask Dataframe?

Is there a way to read a .csv file that is compressed via gz into a dask dataframe?

I've tried it directly with

import dask.dataframe as dd
df = dd.read_csv("Data.gz" )

but get an unicode error (probably because it is interpreting the compressed bytes) There is a "compression" parameter but compression = "gz" won't work and I can't find any documentation so far.

With pandas I can read the file directly without a problem other than the result blowing up my memory ;-) but if I restrict the number of lines it works fine.

import pandas.Dataframe as pd
df = pd.read_csv("Data.gz", ncols=100)
like image 632
Magellan88 Avatar asked Oct 07 '16 19:10

Magellan88


People also ask

Can pandas read GZ file?

gz is not supported by Pandas!

Is DASK faster than pandas?

The original pandas query took 182 seconds and the optimized Dask query took 19 seconds, which is about 10 times faster. Dask can provide performance boosts over pandas because it can execute common operations in parallel, where pandas is limited to a single core.


1 Answers

Panda's current documentation says:

compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’

Since 'infer' is the default, that would explain why it is working with pandas.

Dask's documentation on the compression argument:

String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically

That would suggest that it should also infer the compression for at least gz. That it doesn't (and it still does not in 0.15.3) may be a bug. However, it is working using compression='gzip'.

i.e.:

import dask.dataframe as dd
df = dd.read_csv("Data.gz", compression='gzip')
like image 95
de1 Avatar answered Sep 17 '22 15:09

de1