I'm trying to use dask to read csv file, and it gave me an error like below. But the thing is I want my ARTICLE_ID
be object(string)
. Anyone can help me to read data successfully?
Traceback is like below:
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+------------+--------+----------+
| Column | Found | Expected |
+------------+--------+----------+
| ARTICLE_ID | object | int64 |
+------------+--------+----------+
The following columns also raised exceptions on conversion:
ARTICLE_ID:
ValueError("invalid literal for int() with base 10: ' July 2007 and 31 March 2008. Diagnostic practices of the medical practitioners for establishing the diagnosis of different types of EPTB were studied. Results: For the diagnosi\\\\'",)
Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'ARTICLE_ID': 'object'}
to the call to `read_csv`/`read_table`.
Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.
The message is suggesting that your change your call from
df = dd.read_csv('mylocation.csv', ...)
to
df = dd.read_csv('mylocation.csv', ..., dtype={'ARTICLE_ID': 'object'})
where you should change the file location and any other arguments to what you were using before. If this still doesn't work, then please update your question.
You can use sample
parameter in read_csv
method and assign it an integer to indicate the number of bytes to use when determining dtypes. For example, I had to give it 25000000 to correctly infer the types of my data in the shape of (171907, 161).
df = dd.read_csv("game_logs.csv", sample=25000000)
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With