Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What estimator to use in scikit-learn?

This is my first brush with machine learning, so I'm trying to figure out how this all works. I have a dataset where I've compiled all the statistics of each player to play with my high school baseball team. I also have a list of all the players that have ever made it to the MLB from my high school. What I'd like to do is split the data into a training set and a test set, and then feed it to some algorithm in the scikit-learn package and predict the probability of making the MLB.

So I looked through a number of sources and found this cheat sheet that suggests I start with linear SVC. SciKitLearn Cheat Sheet

So, then as I understand it I need to break my data into training samples where each row is a player and each column is a piece of data about the player (batting average, on base percentage, yada, yada), X_train; and a corresponding truth matrix of a single row per player that is simply 1 (played in MLB) or 0 (did not play in MLB), Y_train. From there, I just do Fit(X,Y) and then I can use predict(X_test) to see if it gets the right values for Y_test.

Does this seem a logical choice of algorithm, method, and application?

EDIT to provide more information: The data is made of 20 features such as number of games played, number of hits, number of Home Runs, number of Strike Outs, etc. Most are basic counting statistics about the players career; a few are rates such as batting average.

I have about 10k total rows to work with, so I can split the data based on that; but I have no idea how to optimally split the data, given that <1% have made the MLB.

like image 510
Zach Avatar asked Oct 29 '22 17:10

Zach


2 Answers

Alright, here are a few steps that might want to make:

  1. Prepare your data set. In practice, you might want to scale the features, but we'll leave it out to make the first working model as simple as possible. So will just need to split the dataset into test/train set. You could shuffle the records manually and take the first X% of the examples as the train set, but there's already a function for it in scikit-learn library: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. You might want to make sure that both: positive and negative examples are present in the train and test set. To do so, you can separate them before the test/train split to make sure that, say 70% of negative examples and 70% of positive examples go the training set.

  2. Let's pick a simple classifier. I'll use logistic regression here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html, but other classifiers have a similar API.

  3. Creating the classifier and training it is easy:

    clf = LogisticRegression()
    clf.fit(X_train, y_train)
    
  4. Now it's time to make our first predictions:

    y_pred = clf.predict(X_test)
    
  5. A very important part of the model is its evaluation. Using accuracy is not a good idea here: the number of positive examples is very small, so the model that unconditionally returns 0 can get a very high score. We can use the f1 score instead: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html.

  6. If you want to predict probabilities instead of labels, you can just use the predict_proba method of the classifier.

That's it. We have a working model! Of course, there are a lot thing you may try to improve, such as scaling the features, trying different classifiers, tuning their hyperparameters, but this should be enough to get started.

like image 156
kraskevich Avatar answered Nov 15 '22 08:11

kraskevich


If you don't have a lot of experience in ML, in scikit learn you have classification algorithms (if the target of your dataset is a boolean or a categorical variable) or regression algorithms (if the target is a continuous variable).

If you have a classification problem, and your variables are in a very different scale a good starting point is a decision tree:

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

The classifier is a Tree and you can see the decisions that are taking in the nodes.

After that you can use random forest, that is a group of decision trees that average results:

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

After that you can put the same scale in every feature:

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

And you can use other algorithms like SVMs.

For every algorithm you need a technique to select its parameters, for example cross validation:

https://en.wikipedia.org/wiki/Cross-validation_(statistics)

But a good course is the best option to learn. In coursera you can find several good courses like this:

https://www.coursera.org/learn/machine-learning

like image 38
Rob Avatar answered Nov 15 '22 08:11

Rob