How to read a compressed (gz) CSV file into a dask Dataframe?

Tags:

Is there a way to read a .csv file that is compressed via gz into a dask dataframe?

I've tried it directly with

import dask.dataframe as dd
df = dd.read_csv("Data.gz" )

but get an unicode error (probably because it is interpreting the compressed bytes) There is a "compression" parameter but compression = "gz" won't work and I can't find any documentation so far.

With pandas I can read the file directly without a problem other than the result blowing up my memory ;-) but if I restrict the number of lines it works fine.

import pandas.Dataframe as pd
df = pd.read_csv("Data.gz", ncols=100)

632

asked Oct 07 '16 19:10

Magellan88

1 Answers

Panda's current documentation says:

compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’

Since 'infer' is the default, that would explain why it is working with pandas.

Dask's documentation on the compression argument:

String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically

That would suggest that it should also infer the compression for at least gz. That it doesn't (and it still does not in 0.15.3) may be a bug. However, it is working using compression='gzip'.

i.e.:

import dask.dataframe as dd
df = dd.read_csv("Data.gz", compression='gzip')

answered Sep 17 '22 15:09

de1

Related questions
                            
                                python networkx remove nodes and edges with some condition
                            
                                Specify absolute colour for 3D points in MayaVi
                            
                                How do I alter a response in flask in the after_request function?
                            
                                Pandas select only numeric or integer field from dataframe
                            
                                Smallest enclosing circle, error in the code
                            
                                How can I get the total number of elements in my arbitrarily nested list of lists?
                            
                                Convert html to pdf using Python/Flask
                            
                                Celery worker hangs without any error
                            
                                Error when installing using pip
                            
                                Custom Colormap in Python
                            
                                How to setup PyCharm for multiple projects
                            
                                Find index of last true value in pandas Series or DataFrame
                            
                                Read a list of hostnames and resolve to IP addresses
                            
                                Accessing Request Object in Viewset and Serializers in Django Rest Framework?
                            
                                Understanding Stacks and Queues in python
                            
                                Pandas, Get count of a single value in a Column of a Dataframe
                            
                                PyQt QTableView Set Horizontal & Vertical Header Labels
                            
                                TensorFlow: Saver has 5 models limit
                            
                                Does python have an equivalent to Javascript's 'btoa'
                            
                                Setting up a LearningRateScheduler in Keras

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read a compressed (gz) CSV file into a dask Dataframe?

Tags:

python

pandas

csv

dask

Magellan88

People also ask

1 Answers

de1

Recent Activity

Donate For Us