I have a question regarding training an ML.NET that can predict if a name is female or not. The model can be trained with a pipeline like this:
var mlContext = new MLContext();
IDataView trainingDataView = mlContext.Data.LoadFromEnumerable(trainingData);
var dataPrepPipeline = mlContext
.Transforms
.Text
.FeaturizeText("FirstNameFeaturized", "FirstName")
.Append(mlContext.Transforms.Text.FeaturizeText("MiddleNameFeaturized", "MiddleName"))
.Append(mlContext.Transforms.Text.FeaturizeText("LastNameFeaturized", "LastName"))
.Append(mlContext.Transforms.Concatenate(
"Features",
"FirstNameFeaturized",
"MiddleNameFeaturized",
"LastNameFeaturized"))
.Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"))
.AppendCacheCheckpoint(mlContext);
var prepPipeline = dataPrepPipeline.Fit(trainingDataView);
var preprocessedData = prepPipeline.Transform(trainingDataView);
var trainer = dataPrepPipeline.Append(mlContext
.BinaryClassification
.Trainers
.AveragedPerceptron(labelColumnName: "IsFemale", numberOfIterations: 10, featureColumnName: "Features"));
ITransformer trainedModel = trainer.Fit(preprocessedData);
I have left out trainingData
from the code. The model looks like this:
public class Person
{
public string FirstName { get; set; }
public string MiddleName { get; set; }
public string LastName { get; set; }
public bool IsFemale { get; set; }
}
I then fetch a list of persons from somewhere (database, csv, whatever) and convert it to Person
objects.
As part of converting the persons to Person
I'm using code looking like this:
var trainingData = new List<Person>();
trainingData.AddRange(persons.Select(p => new Person
{
IsFemale = p.IsFemale,
FirstName = p.FirstName ?? "unknown",
MiddleName = p.MiddleName ?? "unknown",
LastName = p.LastName ?? "unknown"
}));
You might be wondering why I insert unknown
in case one of the name parts are null. This is done since building the ML.NET pipeline fails if any of the properties are null.
So here's my question. When setting name parts to unknown
I would suspect this to produce a poor model. Example: If I have a male person with first name Thomas
and I don't have the other parts, that would produce Thomas unknown unknown
. Wouldn't that increase the probability of other persons being classified as not female if missing middle- and last name? Let's say we have a person named Anna
and we don't have the remaining parts. This will produce Anna unknown unknown
which is close to the other one already marked as non-female.
Of course it will! You are introducing data to the set that cause most machine learning algorithms to lack precision.
There are some techniques that can be used to handle missing data, although in this example these are not numerical features of a person so the most reasonable way to handle these features not having data is to ignore the data missing these features completely when training the model.
If these features were numerical features of a person, such as weight or height, you could use techniques such as using the mean or mode value computed across the entire data set and use that value for the value of the missing feature data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With