Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recognise an arbitrary date string [closed]

I need to be able to recognise date strings. It doesn't matter if I can not distinguish between month and date (e.g. 12/12/10), I just need to classify the string as being a date, rather than converting it to a Date object. So, this is really a classification rather than parsing problem.

I will have pieces of text such as:

"bla bla bla bla 12 Jan 09 bla bla bla 01/04/10 bla bla bla"

and I need to be able to recognise the start and end boundary for each date string within.

I was wondering if anyone knew of any java libraries that can do this. My google-fu hasn't come up with anything so far.

UPDATE: I need to be able to recognise the widest possible set of ways of representing a dates. Of course the naive solution might be to write an if statement for every conceivable format, but a pattern recognition approach, with a trained model, is ideally what I'm after.

like image 883
Joel Avatar asked Oct 03 '10 17:10

Joel


People also ask

Is date a float or string?

It's a string, which is converted into a date. Internally, a date is stored as a number, not a string.

Can we store date in string?

You just need to format your strings properly ( 'YYYY-MM-DD HH:MI:SS' ) before passing them to the database, and MySQL will happily treat them as dates.


3 Answers

Use JChronic

You may want to use DateParser2 from edu.mit.broad.genome.utils package.

like image 158
Puspendu Banerjee Avatar answered Oct 18 '22 15:10

Puspendu Banerjee


You can loop all available date formats in Java:

for (Locale locale : DateFormat.getAvailableLocales()) {
    for (int style =  DateFormat.FULL; style <= DateFormat.SHORT; style ++) {
        DateFormat df = DateFormat.getDateInstance(style, locale);
        try {
                df.parse(dateString);
                // either return "true", or return the Date obtained Date object
        } catch (ParseException ex) {
            continue; // unperasable, try the next one
        }
    }
}

This however won't account for any custom date formats.

like image 27
Bozho Avatar answered Oct 18 '22 15:10

Bozho


Rules that might help you in your quest:

  1. Make or find some sort of a database with known words that match months. Abbreviated and full names, like Jan or January. While searching, it must be case insensitive, because fEBruaRy is also a month, although the person typing it must have been drunk. If you plan to search non-english months, a database is also needed, because no heuristic will find out that "Wrzesień" is polish for september.
  2. For english only, check out ordinal numbers and also make a database for numbers 1 to 31. These will be useful for days and months. If you want to use this approach for other languages, then you will have to do your own research.
  3. Once again, english only, check for "Anno Domini" and "Before Christ", that is, AD and BC respectively. They can also be in form A.D. and B.C.
  4. Concerning numbers themselves that will represent days, months and years, you must know where your limit is. Is it 0-9999, or more? That is, do you want to search for dates that represent years beyond year 9999? If no, then strings that have 1-4 consecutive digits are good guesses for a valid day, month or year.
  5. Days and months have one or two digits. Leading zeros are acceptable, so strings with a format of 0*, where * can be 1-9 are acceptable.
  6. Separators can be tricky, but if you don't allow inconsistent formatting like 10/20\1999, then you will save yourself a lot of grief. This is because 10*20*1999 can be a valid date, with * usually being one element of set {-,_, ,:,/,\,.,','}, but it's possible that * is a combination of 2 or 3 elements of mentioned set. Once again, you must choose acceptable separators. 10?20?1999 can be a valid date for somebody with a weird sense of elegance. 10 / 20 / 1999 can also be a valid date, but 10_/20_/1999 would be a very strange one.
  7. There are cases with no separator. For example: 10Jan1988. These cases use words from 1.
  8. There are special cases, like February 28th or 29th, depending on leap year. Also, months with 30 or 31 days.

I think these are enough for a "naive" classification, a linguist expert might help you more.

Now, an idea for your algorithm. Speed doesn't matter. There might be multiple passes over the same string. Optimize when it starts to matter. When you doubt that you have found a date string, store it somewhere "safe" in a ListOfPossibleDates and do an examination once again, with more rigid rules using combinations from 1. to 8. When you believe a date string is valid, feed it to the Date class to see if it's really valid. 32nd March 1999 is not valid, when you convert it to a format that Date will understand.

One important recurring pattern is lookbehind and lookaround. When you believe a valid entity (day, month, year) is found, you'll have to see what lies behind and after. A stack based mechanism or recursion might help here.

Steps:

  1. Search your string for words from rule 1. If you find any of them, note that location. Note the month. Now, go a few characters behind and a few ahead to see what awaits you. If there are no spaces before and after your month, and there are numbers, like in rule 7., check them for validity. If one of them represents a day (must be 0-31) and other a year (must be 0-9999, possibly with AD or BC), you have one candidate. If there are the same separators before and after, look for rules from 6. Always remember that you must be sure that a valid combination exists. so, 32Jan1999 won't do.
  2. Search your string for other english words, from rules 2. and 3. Repeat similarly like in step 1.
  3. Search for separators. Empty space will be the trickiest. Try to find them in pairs. So, if you have one "/" in your string, find another one and see what they have inbetween. If you find a combination of separators, to the same thing. Also, use the algorithm from step 2.
  4. Search for digits. Valid ones are 0-9999 with leading zeroes allowed. If you find one, look for separators like in step 3.

Since there is literally a countless amount of possibilities, you won't be able to catch them all. Once you have found a pattern that you believe could occur once again, store it somewhere and you can use it as a regex for passing other strings.

Let's take your example, "bla bla bla bla 12 Jan 09 bla bla bla 01/04/10 bla bla bla". After you extract the first date, 12 Jan 09, then use the rest of that string ("bla bla bla 01/04/10 bla bla bla") and apply all above steps once again. This way you'll be sure you didn't miss anything.

I hope these suggestions will be at least of some help. If there doesn't exist a library for do all these dirty (and more) steps for you, then you have a tough road ahead of you. Good luck!

like image 30
darioo Avatar answered Oct 18 '22 14:10

darioo