Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask read_csv-- Mismatched dtypes found in `pd.read_csv`/`pd.read_table`

I'm trying to use dask to read csv file, and it gave me an error like below. But the thing is I want my ARTICLE_ID be object(string). Anyone can help me to read data successfully?

Traceback is like below:

ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+------------+--------+----------+

| Column     | Found  | Expected |

+------------+--------+----------+

| ARTICLE_ID | object | int64    |

+------------+--------+----------+

The following columns also raised exceptions on conversion:

ARTICLE_ID:


ValueError("invalid literal for int() with base 10: ' July 2007 and 31 March 2008. Diagnostic practices of the medical practitioners for establishing the diagnosis of different types of EPTB were studied. Results: For the diagnosi\\\\'",)

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'ARTICLE_ID': 'object'}

to the call to `read_csv`/`read_table`.
like image 653
Coffey Liu Avatar asked Sep 24 '18 20:09

Coffey Liu


People also ask

Is DASK faster than pandas?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.


2 Answers

The message is suggesting that your change your call from

df = dd.read_csv('mylocation.csv', ...)

to

df = dd.read_csv('mylocation.csv', ..., dtype={'ARTICLE_ID': 'object'})

where you should change the file location and any other arguments to what you were using before. If this still doesn't work, then please update your question.

like image 85
mdurant Avatar answered Nov 07 '22 18:11

mdurant


You can use sample parameter in read_csv method and assign it an integer to indicate the number of bytes to use when determining dtypes. For example, I had to give it 25000000 to correctly infer the types of my data in the shape of (171907, 161).

df = dd.read_csv("game_logs.csv", sample=25000000)

https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv

like image 4
gench Avatar answered Nov 07 '22 16:11

gench