I need to be able to recognise date strings. It doesn't matter if I can not distinguish between month and date (e.g. 12/12/10), I just need to classify the string as being a date, rather than converting it to a Date object. So, this is really a classification rather than parsing problem.
I will have pieces of text such as:
"bla bla bla bla 12 Jan 09 bla bla bla 01/04/10 bla bla bla"
and I need to be able to recognise the start and end boundary for each date string within.
I was wondering if anyone knew of any java libraries that can do this. My google-fu hasn't come up with anything so far.
UPDATE: I need to be able to recognise the widest possible set of ways of representing a dates. Of course the naive solution might be to write an if statement for every conceivable format, but a pattern recognition approach, with a trained model, is ideally what I'm after.
It's a string, which is converted into a date. Internally, a date is stored as a number, not a string.
You just need to format your strings properly ( 'YYYY-MM-DD HH:MI:SS' ) before passing them to the database, and MySQL will happily treat them as dates.
Use JChronic
You may want to use DateParser2 from edu.mit.broad.genome.utils package.
You can loop all available date formats in Java:
for (Locale locale : DateFormat.getAvailableLocales()) {
for (int style = DateFormat.FULL; style <= DateFormat.SHORT; style ++) {
DateFormat df = DateFormat.getDateInstance(style, locale);
try {
df.parse(dateString);
// either return "true", or return the Date obtained Date object
} catch (ParseException ex) {
continue; // unperasable, try the next one
}
}
}
This however won't account for any custom date formats.
Rules that might help you in your quest:
Jan
or January
. While searching, it must be case insensitive, because fEBruaRy is also a month, although the person typing it must have been drunk. If you plan to search non-english months, a database is also needed, because no heuristic will find out that "Wrzesień" is polish for september.0*
, where * can be 1-9 are acceptable.{-,_, ,:,/,\,.,','}
, but it's possible that * is a combination of 2 or 3 elements of mentioned set. Once again, you must choose acceptable separators. 10?20?1999 can be a valid date for somebody with a weird sense of elegance. 10 / 20 / 1999 can also be a valid date, but 10_/20_/1999 would be a very strange one.I think these are enough for a "naive" classification, a linguist expert might help you more.
Now, an idea for your algorithm. Speed doesn't matter. There might be multiple passes over the same string. Optimize when it starts to matter. When you doubt that you have found a date string, store it somewhere "safe" in a ListOfPossibleDates
and do an examination once again, with more rigid rules using combinations from 1. to 8. When you believe a date string is valid, feed it to the Date
class to see if it's really valid. 32nd March 1999 is not valid, when you convert it to a format that Date
will understand.
One important recurring pattern is lookbehind and lookaround. When you believe a valid entity (day, month, year) is found, you'll have to see what lies behind and after. A stack based mechanism or recursion might help here.
Steps:
Since there is literally a countless amount of possibilities, you won't be able to catch them all. Once you have found a pattern that you believe could occur once again, store it somewhere and you can use it as a regex for passing other strings.
Let's take your example, "bla bla bla bla 12 Jan 09 bla bla bla 01/04/10 bla bla bla"
. After you extract the first date, 12 Jan 09
, then use the rest of that string ("bla bla bla 01/04/10 bla bla bla"
) and apply all above steps once again. This way you'll be sure you didn't miss anything.
I hope these suggestions will be at least of some help. If there doesn't exist a library for do all these dirty (and more) steps for you, then you have a tough road ahead of you. Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With