ValueError: The columns in the computed data do not match the columns in the provided metadata

Question

I am working on a dataset with 5.5 millions rows in a kaggle competition. Reading the .csv and processing them take hours in Pandas.

Here comes in dask. Dask is fast but with many errors.

This is a snippet of the code,

#drop some columns
df = df.drop(['dropoff_latitude', 'dropoff_longitude','pickup_latitude', 'pickup_longitude', 'pickup_datetime' ], axis=1)


# In[ ]:


#one-hot-encode cat columns
df = dd.get_dummies(df.categorize())


# In[ ]:


#split train and test and export as csv
test_df = df[df['fare_amount'] == -9999]
train_df = df[df['fare_amount'] != -9999]

test_df.to_csv('df_test.csv')
train_df.to_csv('df_train.csv')

which when run the lines;

test_df.to_csv('df_test.csv')
train_df.to_csv('df_train.csv')

produces the error

ValueError: The columns in the computed data do not match the columns
in the provided metadata

What could cause this and how can I stop it.

N.B First time using Dask.

mdurant · Accepted Answer

The docstring describes how this situation can arise when reading from CSV. Likely, if you had done len(dd.read_csv(...)), you would have seen it already, without the drop, dummies and train-split. The error message probably tells you exactly which column(s) are the problem, and what type was expected versus what was found.

What happens, is that dask guesses the dtypes of the data-frame from the first block of the first file. Sometimes this does not reflect the type throughout the whole dataset: for example, if a column happens to have no values in the first block, its type will be float64, because pandas uses nan as a NULL placeholder. In such cases, you want to determine the correct dtypes, and supply them to read_csv using the dtype= keyword. See the pandas documentation for the typical use of dtype= and other arguments for data parsing.conversion that might help at load time.

ValueError: The columns in the computed data do not match the columns in the provided metadata

Tags:

python

python-3.x

dask

acacia

1 Answers

mdurant

Recent Activity

Donate For Us

ValueError: The columns in the computed data do not match the columns in the provided metadata

Tags:

python

python-3.x

dask

acacia

1 Answers

mdurant

Related questions

Recent Activity

Donate For Us