Has anyone found a simple, but effective way to extract date references from text? I've done a fair amount of searching for temporal extraction tools, but there isn't a lot out there. There are a few white papers, but it seems to fall into a subset of the whole semantic web thingy but not given much attention.
I'm just looking for something that is 80% effective. There is no need to capture things like "the month after Jan 2009", but basic common dates entities would be nice.
I'm open to all suggestions, even fancy regex expressions.
Fire away!
(and thanks - Henry)
If the target temporal expressions in your data are only in limited format, use regular expression and iterative approach to refine your system
Otherwise, use Stanford NLP toolkit, SUTime, which might be an over-kill but definitely meet your demands
One way I have done this is to just look for anything that is 4 numbers and convert it to a number. If the number falls within the range of years you are interested in, you probably have a year you can use. If you are interested in any matching months and days you could check adjacent words to see if they are a month name or a number between 1 and 31. I am confident this would satisfy your 80% requirement.
Regex for years: [0-9]{4} - you will need to convert to a number and see if it's within the range of years you consider valid.
Regex for months: jan|january|feb|february ... etc for each month
Regex for days of the month: [0-9]{1,2} - you would need to convert to a number and see if it is 1-31
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With