Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to prevent Azure ML Studio from converting a feature column to DateTime while importing a dataset

Tags:

I’m having some issues trying to load a dataset in Azure ML Studio, a dataset containing a column that looks like a DateTime, but is in fact a string. Azure ML Studio converts the values to DateTimes internally, and no amount of wrangling seems to convince it of the that they’re in fact strings.

This is an issue, because during conversion the values lose precision and start appearing as duplicates whereas in fact they are unique. Does anybody know if ML Studio can be configured not to infer data types for columns while importing a dataset?

Now, for the long(er) story :)

I’m working here with a public dataset - specifically Kaggle’s New York City Fare Prediction competition. I wanted to see if I could do a quick-and-dirty solution using Azure ML Studio, however the dataset’s unique key values are of the form 2015-01-27 13:08:24.0000003 2015-01-27 13:08:24.0000002 2011-10-06 12:10:20.0000001 and so on.

When importing them in my experiment the key values get converted to DateTime, making them no longer unique, even though they’re unique in the csv. Needless to say, this prevents me from submitting any solution to Kaggle, since I can’t identify the rows uniquely :).

I’ve tried the following:

  • edit the metadata of the dataset after it has been loaded and setting the data type of the column to string, but this doesn’t do much as the precision has already been lost
  • import the dataset from an Azure blob, convert it to csv and then loading it in Jupyter/Python - this brings me the same (duplicated) keys.
  • loading the dataset locally with pandas works, as expected.

I’ve reproduced this behavior with both the big, 5.5GB train dataset, but also with the more manageable sample_submission dataset.

Curious to know if there is some sort of workaround to tell ML Studio not to try converting this column while loading the dataset. I'm looking here specifically for Azure ML Studio-only solutions, as I don't want to do any preprocessing on the dataset.

like image 548
Vlad Iliescu Avatar asked Aug 10 '18 06:08

Vlad Iliescu


People also ask

What does the Featurization setting do in automated ML?

Featurization includes automated feature engineering (when "featurization": 'auto' ) and scaling and normalization, which then impacts the selected algorithm and its hyperparameter values. AutoML supports different methods to ensure you have visibility into what was applied to your model.

How do I import data into azure ml studio?

Select Data source, and choose the data source type. It could be HTTP or datastore. If you choose datastore, you can select existing datastores that are already registered to your Azure Machine Learning workspace or create a new datastore. Then define the path of data to import in the datastore.

What is edit metadata for in Azure ML?

Use the Edit Metadata component to change metadata that's associated with columns in a dataset. The value and data type of the dataset will change after use of the Edit Metadata component. Typical metadata changes might include: Treating Boolean or numeric columns as categorical values.

What options do you have to create datasets in Azure Machine Learning Studio?

You can create Data from datastores, Azure Storage, public URLs, and local files.


1 Answers

I have tried with you sample data and here is my quick and dirty solution: 1) Add any symbol (I've added the '#') in front of each date 2) Load it to AML Studio (it is now considered as a string feature) 3) Add a Python/R component to remove the '#' symbol and explicitly convert the column to string (as.string(columnname) or str(columnname))

Hope this helps

like image 61
Alibek Jakupov Avatar answered Sep 28 '22 05:09

Alibek Jakupov