I am working on a dataset with 5.5 millions rows in a kaggle competition. Reading the .csv and processing them take hours in Pandas.
Here comes in dask. Dask is fast but with many errors.
This is a snippet of the code,
#drop some columns
df = df.drop(['dropoff_latitude', 'dropoff_longitude','pickup_latitude', 'pickup_longitude', 'pickup_datetime' ], axis=1)
# In[ ]:
#one-hot-encode cat columns
df = dd.get_dummies(df.categorize())
# In[ ]:
#split train and test and export as csv
test_df = df[df['fare_amount'] == -9999]
train_df = df[df['fare_amount'] != -9999]
test_df.to_csv('df_test.csv')
train_df.to_csv('df_train.csv')
which when run the lines;
test_df.to_csv('df_test.csv')
train_df.to_csv('df_train.csv')
produces the error
ValueError: The columns in the computed data do not match the columns
in the provided metadata
What could cause this and how can I stop it.
N.B First time using Dask.
The docstring describes how this situation can arise when reading from CSV. Likely, if you had done len(dd.read_csv(...))
, you would have seen it already, without the drop, dummies and train-split. The error message probably tells you exactly which column(s) are the problem, and what type was expected versus what was found.
What happens, is that dask guesses the dtypes of the data-frame from the first block of the first file. Sometimes this does not reflect the type throughout the whole dataset: for example, if a column happens to have no values in the first block, its type will be float64
, because pandas uses nan
as a NULL placeholder. In such cases, you want to determine the correct dtypes, and supply them to read_csv
using the dtype=
keyword. See the pandas documentation for the typical use of dtype=
and other arguments for data parsing.conversion that might help at load time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With