Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle date variable in machine learning data pre-processing

I have a data-set that contains among other variables the time-stamp of the transaction in the format 26-09-2017 15:29:32. I need to find possible correlations and predictions of the sales (lets say in logistic regression). My questions are:

  1. How to handle the date format? Shall I convert it to one number (like excel does automatically)? Shall I split it in more variables like day, month, year, hour, mins, seconds? any other possible suggestions?
  2. What if I would like to add distinct week number per year? shall I add variable like 342017(week 34 of year 2017)?
  3. Shall I make the same for question 2 for quarter of year?
#         Datetime               Gender        Purchase
1    23/09/2015 00:00:00           0             1
2    23/09/2015 01:00:00           1             0
3    25/09/2015 02:00:00           1             0
4    27/09/2015 03:00:00           1             1
5    28/09/2015 04:00:00           0             0
like image 690
yppdgr Avatar asked Sep 26 '17 14:09

yppdgr


People also ask

How does machine learning deal with date data?

Basically you can break apart the date and get the year, month, week of year, day of month, hour, minute, second, etc. You can also get the day of the week (Monday = 0, Sunday = 6). Note be careful with week of year because the first few days of the year may be 53 if that week begins in the prior year.

Can we use date as a feature in machine learning?

Instead, DateTime can be used to extract new features, that can be added to the other available features of the dataset. A date is composed of a day, a month, and a year. From, these three parts, at least four different features could be extracted: Day of Year or Day of Month or Day of the Week.


2 Answers

Cyclic Feature Encoding

Data that has a unique set of values that repeat in a cycle are known as cyclic data. Time-related features are mainly cyclic in nature. For example, months of a year, days of a week, hours of time, minutes of time etc... These features have a set of values and all the observations will have a value from this set only. In many ML problems, we encounter such features. Handling such features properly have proved to help in the improvement of accuracy.

Implementation

def encode(data, col, max_val):
    data[col + '_sin'] = np.sin(2 * np.pi * data[col]/max_val)
    data[col + '_cos'] = np.cos(2 * np.pi * data[col]/max_val)
    return data

data['month'] = data.datetime.dt.month
data = encode(data, 'month', 12)

data['day'] = data.datetime.dt.month
data = encode(data, 'day', 365)

The Logic

A common method for encoding cyclical data is to transform the data into two dimensions using a sine and cosine transformation. Map each cyclical variable onto a circle such that the lowest value for that variable appears right next to the largest value. We compute the x- and y- components of that point using sin and cos trigonometric functions.

$x_{sin} = \sin(\frac{2 * \pi * x}{\max(x)})$

$x_{cos} = \cos(\frac{2 * \pi * x}{\max(x)})$

For handling months we consider them from 0-11 and refer to the below figure.

enter image description here

We can do that using the following transformations:

More on Feature Engineering Cyclic Features

like image 30
Pluviophile Avatar answered Oct 19 '22 21:10

Pluviophile


Some random thoughts:

Dates are good sources for feature engineering, I don't think there is one method to use dates in a model. Business user expertise would be great; are there observed trends that can be coded into the data?

Possible suggestions of features include:

  • weekends vs weekdays
  • business hours and time of day
  • seasons
  • week of year number
  • month
  • year
  • beginning/end of month (pay days)
  • quarter
  • days to/from an action event(distance)
  • missing or incomplete data
  • etc.

All this depends on the data set and most won't apply.

some links:

http://appliedpredictivemodeling.com/blog/2015/7/28/feature-engineering-versus-feature-extraction

https://www.salford-systems.com/blog/dan-steinberg/using-dates-in-data-mining-models

http://trevorstephens.com/kaggle-titanic-tutorial/r-part-4-feature-engineering/

like image 134
Ryan John Avatar answered Oct 19 '22 21:10

Ryan John