Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ML.NET TrainTestSplit random seed

I am using TrainTestSplit in ML.NET, to repeatedly split my data set into a training and test set. In e.g. sklearn, the corresponding function takes a seed as an input, so that it is possible to obtain different splits, but in ML.NET repeated calls to TrainTestSplit seems to return the same split. Is it possible to change the random seed used by TrainTestSplit?

like image 364
Petter T Avatar asked Mar 05 '23 01:03

Petter T


2 Answers

Right now TrainTestSplit doesn't take a random seed. There is a bug opened in ML.NET to fix this: https://github.com/dotnet/machinelearning/issues/1635

As a short-term workaround, I recommend manually adding a random column to the data view, and using it as a stratificationColumn in TrainTestSplit:

data = new GenerateNumberTransform(mlContext,  new GenerateNumberTransform.Arguments
                {
                    Column = new[] { new GenerateNumberTransform.Column { Name = "random" } },
                    Seed = 42 // change seed to get a different split
                }, data);
(var train, var test) = mlContext.Regression.TrainTestSplit(data, stratificationColumn: "random");

This code will work with ML.NET 0.7, and we will fix the seed in 0.8.

like image 97
Zruty Avatar answered Mar 14 '23 22:03

Zruty


As of today (ML.NET v1.0), this has been solved. TrainTestSplit takes a seed as input, and it also supports stratification by setting samplingKeyColumnName:

TrainTestSplit(IDataView data, double testFraction = 0.1, string samplingKeyColumnName = null, Nullable<int> seed = null);
like image 36
Petter T Avatar answered Mar 14 '23 22:03

Petter T