Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using placeholder on empty string when training model with ML.NET

Tags:

ml.net

I have a question regarding training an ML.NET that can predict if a name is female or not. The model can be trained with a pipeline like this:

var mlContext = new MLContext();
IDataView trainingDataView = mlContext.Data.LoadFromEnumerable(trainingData);
var dataPrepPipeline = mlContext
    .Transforms
    .Text
    .FeaturizeText("FirstNameFeaturized", "FirstName")
    .Append(mlContext.Transforms.Text.FeaturizeText("MiddleNameFeaturized", "MiddleName"))
    .Append(mlContext.Transforms.Text.FeaturizeText("LastNameFeaturized", "LastName"))
    .Append(mlContext.Transforms.Concatenate(
        "Features",
        "FirstNameFeaturized",
        "MiddleNameFeaturized",
        "LastNameFeaturized"))
    .Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"))
    .AppendCacheCheckpoint(mlContext);

var prepPipeline = dataPrepPipeline.Fit(trainingDataView);
var preprocessedData = prepPipeline.Transform(trainingDataView);

var trainer = dataPrepPipeline.Append(mlContext
    .BinaryClassification
    .Trainers
    .AveragedPerceptron(labelColumnName: "IsFemale", numberOfIterations: 10, featureColumnName: "Features"));

ITransformer trainedModel = trainer.Fit(preprocessedData);

I have left out trainingData from the code. The model looks like this:

public class Person
{
    public string FirstName { get; set; }
    public string MiddleName { get; set; }
    public string LastName { get; set; }
    public bool IsFemale { get; set; }
}

I then fetch a list of persons from somewhere (database, csv, whatever) and convert it to Person objects.

As part of converting the persons to Person I'm using code looking like this:

var trainingData = new List<Person>();
trainingData.AddRange(persons.Select(p => new Person
{
    IsFemale = p.IsFemale,
    FirstName = p.FirstName ?? "unknown",
    MiddleName = p.MiddleName ?? "unknown",
    LastName = p.LastName ?? "unknown"
}));

You might be wondering why I insert unknown in case one of the name parts are null. This is done since building the ML.NET pipeline fails if any of the properties are null.

So here's my question. When setting name parts to unknown I would suspect this to produce a poor model. Example: If I have a male person with first name Thomas and I don't have the other parts, that would produce Thomas unknown unknown. Wouldn't that increase the probability of other persons being classified as not female if missing middle- and last name? Let's say we have a person named Anna and we don't have the remaining parts. This will produce Anna unknown unknown which is close to the other one already marked as non-female.

like image 799
ThomasArdal Avatar asked Nov 16 '22 03:11

ThomasArdal


1 Answers

Of course it will! You are introducing data to the set that cause most machine learning algorithms to lack precision.

There are some techniques that can be used to handle missing data, although in this example these are not numerical features of a person so the most reasonable way to handle these features not having data is to ignore the data missing these features completely when training the model.

If these features were numerical features of a person, such as weight or height, you could use techniques such as using the mean or mode value computed across the entire data set and use that value for the value of the missing feature data.

like image 112
Jimenemex Avatar answered Nov 29 '22 06:11

Jimenemex