Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for matching 'noisy' names

I have an application which scrapes soccer results from different sources on the web. Team names are not consistent on different websites - eg Manchester United might be called 'Man Utd' on one site, 'Man United' on a second, 'Manchester United FC' on a third. I need to map all possible derivations back to a single name ('Manchester United'), and repeat the process for each of 20 teams in the league (Arsenal, Liverpool, Man City etc). Obviously I don't want any bad matches [eg 'Man City' being mapped to 'Manchester United'].

Right now I specify regexes for all the possible combinations - eg 'Manchester United' would be 'man(chester)?(u|(utd)|(united))(fc)?'; this is fine for a couple of sites but is getting increasingly unwieldy. I'm looking for a solution which would avoid having to specify these regexes. Eg there must be a way to 'score' Man Utd so it gets a high score against 'Manchester United', but a low / zero score against 'Liverpool' [for example]; I'd test the sample text against all possible solutions and pick the one with the highest score.

My sense is that the solution may be similar to the classic example of a neural net being trained to recognise handwriting [ie there is a fixed set of possible outcomes, and a degree of noise in the input samples]

Anyone have any ideas ?

Thanks.

like image 216
Justin Avatar asked Aug 02 '10 11:08

Justin


2 Answers

It appears that you're screen scraping the same sources.

Assuming your sources are consistent in naming the teams, a string conversion would be the most effective solution.

Man Utd -> Manchester United

Manchester United FC -> Manchester United

like image 168
Gilbert Le Blanc Avatar answered Oct 05 '22 07:10

Gilbert Le Blanc


I've solved this exact problem in Python but without any sophisticated AI. I just have a text file that maps the different variations to the canonical form of the name. There aren't that many variations and once you've enumerated them all they will rarely change.

My file looks something like this:

man city=Manchester City
man united=Manchester United
man utd=Manchester United
manchester c=Manchester City
manchester utd=Manchester United

I load these aliases into a dictionary object and then when I have a name to map, I convert it to lowercase (to avoid any problems with differing capitalisation) and then look it up in the dictionary.

If you know how many teams there are supposed to be, you can also add a check to warn you if you find more distinct names than you are expecting.

like image 40
Dan Dyer Avatar answered Oct 05 '22 07:10

Dan Dyer