ML.NET Build and Train Model using DataTable with Feature columns only known at run-time

Question

I am trying to write a C# wrapper method to make it easier for me to create, train and use an ML.NET Classification model WITHOUT having to hard-code a class containing my predictor variables and target variable. I have looked at all the examples and ML.NET documentation I could find but could not find a complete example from reading data to using the model.

Below is the method I have in mind. You will note that the code for variables "trainingDataView" and "dataProcessPipeline" is incomplete. This is code I have tried all day using various approaches but to no avail. I keep getting an error at the crossvalidate stage telling me that my target column was not found.

public static ITransformer CreateClassificationModelExample(MLContext mlContext, DataTable data, List<string> featureColumns, String targetColumn)
        {

            //I am stuck here. Ideally I would like to see a code snippet to create a IDataView from the DataTable passed in as parameter
            //and then selecting only the columns in parameter 'featureColumns' and target = parameter 'targetColumn'
            var trainingDataView = ????; 


            // Data process configuration with pipeline data transformations 
            var dataProcessPipeline = mlContext.Transforms.Conversion.MapValueToKey(targetColumn, targetColumn)
                                      .Append(mlContext.Transforms.Categorical.OneHotEncoding(ValToKeys))
                                      .Append(mlContext.Transforms.Concatenate("Features", featureSet))
                                      .Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"))
                                      .AppendCacheCheckpoint(mlContext);


            // Set the training algorithm 
            var trainer = mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy(labelColumnName: targetColumn, featureColumnName: "Features")
                                     .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));

            var trainingPipeline = dataProcessPipeline.Append(trainer);

            // Evaluate quality of Model
            var crossValidationResults = mlContext.MulticlassClassification.CrossValidate(trainingDataView, trainingPipeline, numberOfFolds: 5, labelColumnName: targetColumn);

            // Train Model
            ITransformer model = trainingPipeline.Fit(trainingDataView);

            return model;
        }

I have thoroughly explored the ML.NET documentation, including the LoadFromEnumerable method example. Also I looked at the ML.NET blog and cookbook discussions on this topic.

PLEASE if someone can help with a code snippet to make the above method work I am sure that would help many others also! Thanks!

Fritz45 · Accepted Answer

Well, after one more day of effort I got close though not yet completely free of compile time modifications. The code below shows a Wrapper that more or less does what I want, although it does require that the NUMBER of model features are known at compile time, which is better but far from ideal.

In the example below, I create an IDataView from a DataTable using only specific columns for predictors/features, and a specific column as a Target for the classification model. The code then sets up a trains a classification model (example shows "LbfgsMaximumEntropy" model), evaluates it using cross-validation and then trains it. I also show some code on how to create a prediction engine and make a prediction. NOTE THAT this code assumes you have 10 predictor/feature variables. But that 10 is easy to change (2 lines in class "Observation" shown below) - much easier than writing a class each time you want to use a new data table to predict from.

Here is the code. It is a bit old style as I do not use Lambda Expressions:

public static ITransformer CreateClassificationModel(MLContext mlContext, DataTable data, List<string> predictorColumns, String TargetColumn, Dictionary<string, int> TargetMapper)
        {
            //Create instances of the GENERIC class Observation and set the values from the DataTable
            //using only the required predictor columns and the target column
            List<Observation> observations = new List<Observation>();
            int iRow = 0;
            foreach (DataRow row in data.Rows)
            {
                var obs = new Observation();

                int iFeature = 0;
                foreach (string predictorColumn in predictorColumns)
                {
                    obs.Features[iFeature] = Convert.ToSingle(row[predictorColumn]);
                    iFeature++;
                }
                obs.Target = TargetMapper[row[TargetColumn].ToString()];                
                observations.Add(obs);
                iRow++;
            }

            IEnumerable<Observation> dataNew = observations;

            var definedSchema = SchemaDefinition.Create(typeof(Observation));

            // Read the data into an IDataView with the modified schema supplied in
            IDataView trainingDataView = mlContext.Data.LoadFromEnumerable(observations, definedSchema);

            var featureSet = new String[1];  
            featureSet[0] = "Features";

            // Data process configuration with pipeline data transformations 
            var dataProcessPipeline = mlContext.Transforms.Conversion.MapValueToKey("Target", "Target")
                                      .Append(mlContext.Transforms.Concatenate("Features", featureSet))
                                      .AppendCacheCheckpoint(mlContext);

            // Set the training algorithm 
            var trainer = mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy(labelColumnName: "Target", featureColumnName: "Features")
                                      .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));
            IEstimator<ITransformer> trainingPipeline = trainingPipeline = dataProcessPipeline.Append(trainer);


            // Evaluate quality of Model
            var crossValidationResults = mlContext.MulticlassClassification.CrossValidate(trainingDataView, trainingPipeline, numberOfFolds: 5, labelColumnName: "Target");

            // Train Model
            ITransformer model = trainingPipeline.Fit(trainingDataView);


            return model;
        }

To test/use this model, the following PredictionEngine can be used (snippet):

List<Observation> testData = GetTestDataList();  //Get some test data as Observations

   // Create a prediction engine from the model for feeding new data.
  var engine = mlContext.Model.CreatePredictionEngine<Observation, ModelOutput>(model);

   //Make a prediction. The result is of type Output, class shown below.        
   var output = engine.Predict(testData[0]);

And finally, below are the definitions for the two classes needed in the above code:

public class Observation
    {
        private float[] m_Features = new Single[10];

        [VectorType(10)]
        public float[] Features
        {
            get
            {
                return m_Features;
            }
        }

        public int Target { get; set; }

    }

    public class ModelOutput
    {
        // ColumnName attribute is used to change the column name from
        // its default value, which is the name of the field.
        [ColumnName("PredictedLabel")]
        public Int32 Prediction { get; set; }
        public float[] Score { get; set; }
    }

ML.NET Build and Train Model using DataTable with Feature columns only known at run-time

Tags:

c#

machine-learning

multilabel-classification

ml.net

Fritz45

1 Answers

Fritz45

Recent Activity

Donate For Us

ML.NET Build and Train Model using DataTable with Feature columns only known at run-time

Tags:

c#

machine-learning

multilabel-classification

ml.net

Fritz45

1 Answers

Fritz45

Related questions

Recent Activity

Donate For Us