I'm trying to come up with a parser for football plays. I use the term "natural language" here very loosely so please bear with me as I know little to nothing about this field.
Here are some examples of what I'm working with (Format: TIME|DOWN&DIST|OFF_TEAM|DESCRIPTION):
04:39|4th and 20@NYJ46|Dal|Mat McBriar punts for 32 yards to NYJ14. Jeremy Kerley - no return. FUMBLE, recovered by NYJ.|
04:31|1st and 10@NYJ16|NYJ|Shonn Greene rush up the middle for 5 yards to the NYJ21. Tackled by Keith Brooking.|
03:53|2nd and 5@NYJ21|NYJ|Mark Sanchez rush to the right for 3 yards to the NYJ24. Tackled by Anthony Spencer. FUMBLE, recovered by NYJ (Matthew Mulligan).|
03:20|1st and 10@NYJ33|NYJ|Shonn Greene rush to the left for 4 yards to the NYJ37. Tackled by Jason Hatcher.|
02:43|2nd and 6@NYJ37|NYJ|Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins.|
02:02|1st and 10@NYJ44|NYJ|Shonn Greene rush to the right for 1 yard to the NYJ45. Tackled by Anthony Spencer.|
01:23|2nd and 9@NYJ45|NYJ|Mark Sanchez pass to the left to LaDainian Tomlinson for 5 yards to the 50. Tackled by Sean Lee.|
As of now, I've written a dumb parser that handles all the easy stuff (playID, quarter, time, down&distance, offensive team) along with some scripts that goes and gets this data and sanitizes it into the format seen above. A single line gets turned into a "Play" object to be stored into a database.
The tough part here (for me at least) is parsing the description of the play. Here is some information that I would like to extract from that string:
Example string:
"Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins."
Result:
turnover = False
interception = False
fumble = False
to_on_downs = False
passing = True
rushing = False
direction = 'left'
loss = False
penalty = False
scored = False
TD = False
PA = False
FG = False
TPC = False
SFTY = False
punt = False
kickoff = False
ret_yardage = 0
yardage_diff = 7
playmakers = ['Mark Sanchez', 'Shonn Greene', 'Mike Jenkins']
The logic that I had for my initial parser went something like this:
# pass, rush or kick
# gain or loss of yards
# scoring play
# Who scored? off or def?
# TD, PA, FG, TPC, SFTY?
# first down gained
# punt?
# kick?
# return yards?
# penalty?
# def or off?
# turnover?
# INT, fumble, to on downs?
# off play makers
# def play makers
The descriptions can get pretty hairy (multiple fumbles & recoveries with penalties, etc) and I was wondering if I could take advantage of some NLP modules out there. Chances are I'm going to spend a few days on a dumb/static state-machine like parser instead but if anyone has suggestions on how to approach it using NLP techniques I'd like to hear about them.
I think pyparsing would be very useful here.
Your input text looks very regular (unlike real natural language), and pyparsing is great at this stuff. you should have a look at it.
For example to parse the following sentences:
Mat McBriar punts for 32 yards to NYJ14.
Mark Sanchez rush to the right for 3 yards to the NYJ24.
You would define a parse sentence with something like(look for exact syntax in docs):
name = Group(Word(alphas) + Word(alphas)).setResultsName('name')
action = Or(Exact("punts"),Exact("rush")).setResultsName('action') + Optional(Exact("to the")) + Or(Exact("left"), Exact("right")) )
distance = Word(number).setResultsName("distance") + Exact("yards")
pattern = name + action + Exact("for") + distance + Or(Exact("to"), Exact("to the")) + Word()
And pyparsing would break strings using this pattern. It will also return a dictionary with the items name, action and distance - extracted from the sentence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With