I have a data-set that contains among other variables the time-stamp of the transaction in the format 26-09-2017 15:29:32. I need to find possible correlations and predictions of the sales (lets say in logistic regression). My questions are:
# Datetime Gender Purchase 1 23/09/2015 00:00:00 0 1 2 23/09/2015 01:00:00 1 0 3 25/09/2015 02:00:00 1 0 4 27/09/2015 03:00:00 1 1 5 28/09/2015 04:00:00 0 0
Basically you can break apart the date and get the year, month, week of year, day of month, hour, minute, second, etc. You can also get the day of the week (Monday = 0, Sunday = 6). Note be careful with week of year because the first few days of the year may be 53 if that week begins in the prior year.
Instead, DateTime can be used to extract new features, that can be added to the other available features of the dataset. A date is composed of a day, a month, and a year. From, these three parts, at least four different features could be extracted: Day of Year or Day of Month or Day of the Week.
Data that has a unique set of values that repeat in a cycle are known as cyclic data. Time-related features are mainly cyclic in nature. For example, months of a year, days of a week, hours of time, minutes of time etc... These features have a set of values and all the observations will have a value from this set only. In many ML problems, we encounter such features. Handling such features properly have proved to help in the improvement of accuracy.
Implementation
def encode(data, col, max_val):
data[col + '_sin'] = np.sin(2 * np.pi * data[col]/max_val)
data[col + '_cos'] = np.cos(2 * np.pi * data[col]/max_val)
return data
data['month'] = data.datetime.dt.month
data = encode(data, 'month', 12)
data['day'] = data.datetime.dt.month
data = encode(data, 'day', 365)
The Logic
A common method for encoding cyclical data is to transform the data into two dimensions using a sine and cosine transformation. Map each cyclical variable onto a circle such that the lowest value for that variable appears right next to the largest value. We compute the x- and y- components of that point using sin and cos trigonometric functions.
$x_{sin} = \sin(\frac{2 * \pi * x}{\max(x)})$
$x_{cos} = \cos(\frac{2 * \pi * x}{\max(x)})$
For handling months we consider them from 0-11 and refer to the below figure.
We can do that using the following transformations:
More on Feature Engineering Cyclic Features
Some random thoughts:
Dates are good sources for feature engineering, I don't think there is one method to use dates in a model. Business user expertise would be great; are there observed trends that can be coded into the data?
Possible suggestions of features include:
All this depends on the data set and most won't apply.
some links:
http://appliedpredictivemodeling.com/blog/2015/7/28/feature-engineering-versus-feature-extraction
https://www.salford-systems.com/blog/dan-steinberg/using-dates-in-data-mining-models
http://trevorstephens.com/kaggle-titanic-tutorial/r-part-4-feature-engineering/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With