We're using Lucene to develop a free text search box for data delivered to a user, as in the case of an email Inbox. We'd like to allow for the box to handle dates, for instance 5/1/2011. To make things easier, we are limiting the current version of the feature to just two date formats:
mm/dd/yy
mm/dd/yyyy
For our prototype we hacked the query analysis process to attempt to pre-process the query string to look for these two date patterns. This was about 2 years ago, and we were on Lucene 2.4. Im curious to see if there are any tools in Lucene out-of-the-box to accept a DateFormat and return a TokenStream with any identified dates. Looking through the javadocs for Lucene 2.9, I found the class:
org.apache.lucene.analysis.sinks.DateRecognizerSinkFilter
which seems to do what I need, but it implements a SinkFilter, a concept which doesn't seem to be documented in the Lucene Wiki. Has anyone used this filter before, and if so, what is the most effective way to use it?
There is a bit of sample code (which is, admittedly, over-complicated) in the documentation for TeeSinkTokenFilter. Note that the way the DateRecognizerSinkFilter is designed, it does not store the actual date; it just detects that a token is a date that conforms to the specified format. What I would try is to re-implement the DateRecognizerSinkFilter class to take an array of DateFormat instances, create a new Attribute class called DateAttribute (or some-such) and use the date recognizer subclass to set the parsed date into the DateAttribute if one of its formats matches. That way, you can always test whether you have a valid date by interrogating the DateAttribute, and localize the date formats to one class. Another advantage is that you won't have to handle multiple sinks, thereby simplifying the code from the linked example.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With