Predicting Football match winners based only on previous data of same match

Tags:

I'm a huge football(soccer) fan and interested in Machine Learning too. As a project for my ML course I'm trying to build a model that would predict the chance of winning for the home team, given the names of the home and away team.(I query my dataset and accordingly create datapoints based on previous matches between those 2 teams)

I have data for several seasons for all teams however I have the following issues that I would like some advice with.. The EPL(English Premier League) has 20teams which play each other at home and away (380 total games in a season). Thus, each season, any 2 teams play each other only twice.

I have data for the past 10+ years, resulting in 2*10=20 datapoints for the two teams. However I do not want to go past 3 years since I believe teams change quite considerably over time (ManCity, Liverpool) and this would only introduce more error into the system.

So this results in just around 6-8 data points for each pair of team. However, I do have several features(upto 20+) for each data point like Full-time goals, half time goals, passes, shots, yellows, reds, etc. for both teams so I can include features like recent form, recent home form, recent away form etc.

However the idea of just having only 6-8 datapoints to train with seems incorrect to me. Any thoughts on how I could counter this problem?(if this is a problem in the first place i.e.)

Thanks!

EDIT: FWIW, here's a link to my report which I compiled at the completion of my project. https://www.dropbox.com/s/ec4a66ytfkbsncz/report.pdf . It's not 'great' stuff but I think some of the observations I managed to elicit were pretty cool (like how my prediction worked very well for the Bundesliga because Bayern win the league all the time).

302

asked Mar 20 '13 01:03

keithxm23

3 Answers

I have some similar system - a good base for source data is football-data.co.uk. I have used last N seasons for each league and built a model (believe me, more than 3 years is a must!). Depends on your criterial function - if criterion is best-fit or maximum profit you may build your own predicting model.

One very good thing to know is that each league is different, also bookmaker gives different home win odds on favorite in Belgium than in 5th English League, where you can find really value odds for instance.

Out of that you can compile interesting model, such as betting tips to beat bookmakers on specific matches, using your pattern and to have value bets. Or you can try to chase as much winning tips as you can, but possibly earns less (draws earn a lot of money even though less amount of draws is winning).

Hopefully I gave you some ideas, for more feel free to ask.

152

answered Sep 29 '22 10:09

kovomaster

That's an interesting problem which I don't think has an unique solution. However, there are a couple of little things that I could try if I were in your position.

I share your concerning about 6-8 points per class being too little data to build a reliable model. So I would try to model the problem a bit differently. In order to have more data for each class, instead of having 20 classes I would have only two (home/away) and I would add two features, one for the team being home and other one for the away team. In that setup, you can still predict which team would win given if it is playing as home or away, and your problem has more data to produce a result.

Another idea would be to take data from other European leagues. Since now teams are a feature and not a class, it shouldn't add too much noise to your model and you could benefit from the additional data (assuming that those features are valid in another leagues)

answered Sep 29 '22 12:09

Pedrom

Don't know if this is still helpful, but features like Full-time goals, half time goals, passes, shots, yellows, reds, etc. are features that you don't have for the new match that you want to classify.

I would treat this as a classification problem (you want to classify the match in one of 3 categories: 1, X, or 2) and add more features that you can also apply to the new match. i.e: the number of missing players (due to injury/red cards), the number of wins/draws/losses each team has had in a row immediately BEFORE the match, which is the home team (already mentioned), goals scored in the last few matches home and away etc...

Having 6-8 matches is the real problem. This dataset is very small and there would be a lot of over-fitting, but if you use features like the ones I mentioned, I think you could also use older data.

answered Sep 29 '22 12:09

tomas

Related questions
                            
                                Multi label regression in Caffe
                            
                                Optimize deep Q network with long episode
                            
                                Understanding FeatureHasher, collisions and vector size trade-off
                            
                                NLP for extracting actions from text
                            
                                Libsvm precomputed kernels
                            
                                Production architecture for big data real time machine learning application?
                            
                                Using adaboost within R's caret package
                            
                                Is Apache Spark less accurate than Scikit Learn?
                            
                                Use a metric after a classifier in a Pipeline
                            
                                How to include batch size in pytorch basic example?
                            
                                Problem with missing and unexpected keys while loading my model in Pytorch
                            
                                Classify data using Apache Mahout
                            
                                No. of hidden layers, units in hidden layers and epochs till Neural Network starts behaving acceptable on Training data
                            
                                How do you visualize a ward tree from sklearn.cluster.ward_tree?
                            
                                Is the xgboost documentation wrong ? (early stopping rounds and best and last iteration)
                            
                                Should binary features be one-hot encoded?
                            
                                Python OCR: ignore signatures in documents
                            
                                Keras reports TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
                            
                                Why does the gated activation function (used in Wavenet) work better than a ReLU?
                            
                                Principal Component Analysis (PCA) on huge sparse dataset

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Predicting Football match winners based only on previous data of same match

Tags:

machine-learning

neural-network

regression

prediction