Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Natural Language parser for parsing sports play-by-play data

I'm trying to come up with a parser for football plays. I use the term "natural language" here very loosely so please bear with me as I know little to nothing about this field.

Here are some examples of what I'm working with (Format: TIME|DOWN&DIST|OFF_TEAM|DESCRIPTION):

04:39|4th and 20@NYJ46|Dal|Mat McBriar punts for 32 yards to NYJ14. Jeremy Kerley - no return. FUMBLE, recovered by NYJ.|
04:31|1st and 10@NYJ16|NYJ|Shonn Greene rush up the middle for 5 yards to the NYJ21. Tackled by Keith Brooking.|
03:53|2nd and 5@NYJ21|NYJ|Mark Sanchez rush to the right for 3 yards to the NYJ24. Tackled by Anthony Spencer. FUMBLE, recovered by NYJ (Matthew Mulligan).|
03:20|1st and 10@NYJ33|NYJ|Shonn Greene rush to the left for 4 yards to the NYJ37. Tackled by Jason Hatcher.|
02:43|2nd and 6@NYJ37|NYJ|Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins.|
02:02|1st and 10@NYJ44|NYJ|Shonn Greene rush to the right for 1 yard to the NYJ45. Tackled by Anthony Spencer.|
01:23|2nd and 9@NYJ45|NYJ|Mark Sanchez pass to the left to LaDainian Tomlinson for 5 yards to the 50. Tackled by Sean Lee.|

As of now, I've written a dumb parser that handles all the easy stuff (playID, quarter, time, down&distance, offensive team) along with some scripts that goes and gets this data and sanitizes it into the format seen above. A single line gets turned into a "Play" object to be stored into a database.

The tough part here (for me at least) is parsing the description of the play. Here is some information that I would like to extract from that string:

Example string:

"Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins."

Result:

turnover = False
interception = False
fumble = False
to_on_downs = False
passing = True
rushing = False
direction = 'left'
loss = False
penalty = False
scored = False
TD = False
PA = False
FG = False
TPC = False
SFTY = False
punt = False
kickoff = False
ret_yardage = 0
yardage_diff = 7
playmakers = ['Mark Sanchez', 'Shonn Greene', 'Mike Jenkins']

The logic that I had for my initial parser went something like this:

# pass, rush or kick
# gain or loss of yards
# scoring play
    # Who scored? off or def?
    # TD, PA, FG, TPC, SFTY?
# first down gained
# punt?
# kick?
    # return yards?
# penalty?
    # def or off?
# turnover?
    # INT, fumble, to on downs?
# off play makers
# def play makers

The descriptions can get pretty hairy (multiple fumbles & recoveries with penalties, etc) and I was wondering if I could take advantage of some NLP modules out there. Chances are I'm going to spend a few days on a dumb/static state-machine like parser instead but if anyone has suggestions on how to approach it using NLP techniques I'd like to hear about them.

like image 812
Jon Avatar asked Nov 20 '11 02:11

Jon


1 Answers

I think pyparsing would be very useful here.

Your input text looks very regular (unlike real natural language), and pyparsing is great at this stuff. you should have a look at it.

For example to parse the following sentences:

Mat McBriar punts for 32 yards to NYJ14.
Mark Sanchez rush to the right for 3 yards to the NYJ24.

You would define a parse sentence with something like(look for exact syntax in docs):

name = Group(Word(alphas) + Word(alphas)).setResultsName('name')

action = Or(Exact("punts"),Exact("rush")).setResultsName('action') + Optional(Exact("to the")) + Or(Exact("left"), Exact("right")) )

distance = Word(number).setResultsName("distance") + Exact("yards")

pattern = name + action + Exact("for") +  distance + Or(Exact("to"), Exact("to the")) + Word() 

And pyparsing would break strings using this pattern. It will also return a dictionary with the items name, action and distance - extracted from the sentence.

like image 159
Alex Brooks Avatar answered Nov 19 '22 22:11

Alex Brooks