I'm using Microsoft Azure Machine Learning Studio to try an experiment where I use previous analytics captured about a user (at a time, on a day) to try and predict their next action (based on day and time) so that I can adjust the UI accordingly. So if a user normally visits a certain page every Thursday at 1pm, then I would like to predict that behaviour.
Warning - I am a complete novice with ML, but have watched quite a few videos and worked through tutorials like the movie recommendations example.
I have a csv dataset with userid,action,datetime and would like to train a matchbox recommendation model, which, from my research appears to be the best model to use. I can't see a way to use date/time in the training. The idea being that if I could pass in a userid and the date, then the recommendation model should be able to give me a probably result of what that user is most likely to do.
I get results from the predictive endpoint, but the training endpoint gives the following error:
{
"error": {
"code": "ModuleExecutionError",
"message": "Module execution encountered an error.",
"details": [
{
"code": "18",
"target": "Train Matchbox Recommender",
"message": "Error 0018: Training dataset of user-item-rating triples contains invalid data."
}
]
}
}
Here is a link to a public version of the experiment
Any help would be appreciated.
Thanks.
Building Machine Learning Models From the results in Figure 20 above, we see that the LogisticRegression model is the best in terms of the metrics accuracy and F₁-score.
So from messing with this for a while, I think I may see where the issue may lie. I think that the first three inputs of the Train Matchbox Recommender would need to be filled in for an accurate prediction. I'll include screenshots of the sample for recommending restaurants, as well.
The first input would be the dataset consisting of the user, item, and rating.
The second input would be the features of each user.
And the third input would be the features of each feature (restaurant in this case).
So to help with the date/time issue, I'm wondering if the data would need to be munged to match something similar to the restaurant and user data.
I know it's not much, but I hope it helps lead you down the right track.
Maybe this answer could be helpful, you may also take a look on this where you can read:
The problem is probably with the range of rating data. There's an upper limit for rating range, because the training gets expensive if the range between smallest and largest rating is too large.
[...]
One option would be to scale the ratings to a narrower range.
According to this MSDN, please note that you cannot have a gap between the min and max note higher than 100
.
So you have to make a pre-processing on your csv file column data (userid, action, datetime etc...) in order to keep all column data in the [0-99]
range.
Please see bellow a Python implementation (to share the logic):
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
big_gap_arr = [-250,-2350,850,-120,-1235,3212,1,5,65,48,265,1204,65,23,45,895,5000,3,325,3244,5482] #data with big gap
abs_min = abs(min(big_gap_arr)) #get the absolute minimal value
max_diff= ( max(big_gap_arr) + abs_min ) #get the maximal diff
specific_range_arr=[]
for each_value in big_gap_arr:
new_value = ( 99/1. * float( abs_min + each_value) / max_diff ) #get a corresponding value in the [0,99] range
specific_range_arr.append(new_value)
print specific_range_arr #post computed data => all in range [0,99]
Which give you:
[26.54494382022472, 0.0, 40.449438202247194, 28.18820224719101, 14.094101123595506, 70.3061797752809, 29.71769662921348, 29.76825842696629, 30.526685393258425, 30.31179775280899, 33.05477528089887, 44.924157303370784, 30.526685393258425, 29.995786516853933, 30.27387640449438, 41.01825842696629, 92.90730337078652, 29.742977528089888, 33.813202247191015, 70.71067415730337, 99.0]
Note that all data are now in the [0,99]
range
Following this process:
User id could be float instead an integer
Action is an integer (if you got less than 100 actions) or float (if more than 100 actions)
Datetime will be splited in two integer (or one integer and one float), please see bellow:
Concerning:
(A) way to use date/time in the training
You may split your datetime in two column, something like:
one column for the weekday:
one column for the time in the day:
If you need a better granularity (here it is a 15 min window) you may also use float number for the time column.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With