Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Reading csv with separator in python dask

I am trying to create a DataFrame by reading a csv file separated by '#####' 5 hashes

The code is:

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',engine='python')
res = df.compute()

Error is:

Dask dataframe inspected the first 1,000 rows of your csv file to guess the
data types of your columns.  These first 1,000 rows led us to an incorrect

For example a column may have had integers in the first 1000
rows followed by a float or missing value in the 1,001-st row.

You will need to specify some dtype information explicitly using the
``dtype=`` keyword argument for the right column names and dtypes.

    df = dd.read_csv(..., dtype={'my-column': float})

Pandas has given us the following error when trying to parse the file:

  "The 'dtype' option is not supported with the 'python' engine"

File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 245, in _execute_task
return func(*args2)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/dataframe/io.py", line 69, in _read_csv
raise ValueError(msg)

So how to get rid of that.

If i follow the error then i would have to give dtype for every column, but if I have a 100+ columns then that is of no use.

And if i am reading without separator,then everything goes fine but there is ##### everywhere. So after computing it to pandas DataFrame ,is there a way to get rid of that?

So help me in this.

like image 245
Satya Avatar asked Feb 09 '23 07:02


2 Answers

Read the entire file in as dtype=object, meaning all columns will be interpreted as type object. This should read in correctly, getting rid of the ##### in each row. From there you can turn it into a pandas frame using the compute() method. Once the data is in a pandas frame, you can use the pandas infer_objects method to update the types without having to hard code them.

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',dtype='object').compute()
res = df.infer_objects()
like image 81
Benjamin Cohen Avatar answered Feb 24 '23 14:02

Benjamin Cohen

If you want to keep the entire file as a dask dataframe, I had some success with a dataset with a large number of columns simply by increasing the number of bytes sampled in read_csv.

For example:

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv', sep='#####', sample = 1000000) # increase to 1e6 bytes

This can resolve some type inference issues, although unlike Benjamin Cohen's answer, you would need to find the right values to choose for sample/

like image 33
Will C Avatar answered Feb 24 '23 13:02

Will C