Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Probabalistic String Matching in Python

I'm in the process of writing a bot that places bets on the website Betfair using their Python API. I want to place bets on football (soccer) matches when they are in-play.

I've coded an XML feed to give me live data from the games, however the XML feed doesn't always use the same name for football teams as Betfair use.

For example, when referring to Manchester United Betfair might use "Man Utd", whilst the XML feed might use "Man United" or some other variant. I am not limited to popular markets, so building up a standard Betfair to XML name conversion table isn't feasible.

I'm trying to use some kind of probabilistic string matching to give me some indication that the two data sources are referring to the same teams.

So far I've played with Reverend which seems to do some Bayesian calculations, however I don't think I'm using it properly as I have to break the string down into characters to train the guesser. I then simply average the probability that each letter is associated with each name, I'm aware this is mathematically incorrect but I thought it could be a feasible heuristic test.

Here is my code:

import scorefeed
from reverend.thomas import Bayes

guesser = Bayes()
teams=['home','away']


def train(team_no, name):

    for char in name:
        guesser.train(teams[team_no], char)

def untrain(team_no, name):

    for char in name:
        guesser.untrain(teams[team_no], char)

def guess(name):

    home_guess = 0.0
    away_guess = 0.0

    for char in name:

        if len(guesser.guess(char)) > 0:

            for guess in guesser.guess(char):

                if guess[0] == teams[0]:
                    home_guess = home_guess + guess[1]
                    print home_guess
                if guess[0] == teams[1]:
                    away_guess = away_guess + guess[1]
                    print away_guess
    home_guess = home_guess / float(len(name))
    away_guess = away_guess / float(len(name))

    probs = [home_guess, away_guess]
    return probs

def game_match(betfair_game_string, feed_home, feed_away):
    home_team = betfair_game_string[0:betfair_game_string.find(' V ')]
    away_team = betfair_game_string[betfair_game_string.find('V')+2:len(betfair_game_string)]

    train(0, home_team)
    train(1, away_team)

    probs = []
    probs.append(guess(feed_home)[0])
    probs.append(guess(feed_away)[1])

    untrain(0, home_team)
    untrain(1, away_team)

    return probs



print game_match("Man Utd V Lpool", "Manchester United", "Liverpool")

The probability produced with the current setup is [0.4705411764705883, 0.5555]. I would be really grateful for any ideas or improvements.

EDIT: I've had another thought, I want the probability that it is the same match on Betfair and the feed. But this gives me the probability that the first name matches, and that the second name matches. I need to find the probability that the first AND second names match. I have therefore coded up the following function which seems to give me more reasonable results:

def prob_match(probs):

    prob_not_home = 1.0 - probs[0]
    prob_not_away = 1.0 - probs[1]

    prob_not_home_and_away = prob_not_home*prob_not_away
    prob_home_and_away = 1.0 - prob_not_home_and_away

    return prob_home_and_away

I would still appreciate any suggestions for different methods or recommendations of existing libraries that do the same thing, or tips on correcting my probability calculations.

like image 784
James Avatar asked Jun 28 '13 15:06

James


1 Answers

Here is my advice. Read http://norvig.com/spell-correct.html, implement something based on that, and see how well it works. Hopefully it will work well enough.

Speed it up by caching results on the fly so that once it has figured out a guess for a given name, then it just reuses the guess.

Your implementation should have an exception report of the most dubious guesses used, so that you can manually review and either reject or fix them.

like image 172
btilly Avatar answered Sep 19 '22 22:09

btilly