I'm looking for a Java/Scala library that can take an user query and a text and returns if there was a matching or not.
I'm processing a stream of information, ie: Twitter Stream, and can't afford to use a batching process, I need to evaluate each tweet in realtime, instead of index it through Lucene RAMDisk and querying it later.
It's possible create a parser/lexer using ANTLR but this is such common usage that I can't believe nobody create a lib before.
Some samples from TextQuery Ruby library that does exactly what I need:
TextQuery.new("'to be' OR NOT 'to_be'").match?("to be") # => true
TextQuery.new("-test").match?("some string of text") # => true
TextQuery.new("NOT test").match?("some string of text") # => true
TextQuery.new("a AND b").match?("b a") # => true
TextQuery.new("a AND b").match?("a c") # => false
q = TextQuery.new("a AND (b AND NOT (c OR d))")
q.match?("d a b") # => false
q.match?("b") # => false
q.match?("a b cdefg") # => true
TextQuery.new("a~").match?("adf") # => true
TextQuery.new("~a").match?("dfa") # => true
TextQuery.new("~a~").match?("daf") # => true
TextQuery.new("2~a~1").match?("edaf") # => true
TextQuery.new("2~a~2").match?("edaf") # => false
TextQuery.new("a", :ignorecase => true).match?("A b cD") # => true
Once it was implemented in Ruby it's not suitable for my platform, also I can't use JRuby just for this point on our solution:
I found a similar question but couldn't get answer from it: Boolean Query / Expression to a Concrete syntax tree
Thanks!
Given that you are doing text search, I would try to leverage some of the infrastructure provided by Lucene. May be you could create a QueryParser
and call parse
to get back a Query
. Instantiable subclasses of Query are:
TermQuery
MultiTermQuery
BooleanQuery
WildcardQuery
PhraseQuery
PrefixQuery
MultiPhraseQuery
FuzzyQuery
TermRangeQuery
NumericRangeQuery
SpanQuery
Then you may be able to use pattern matching to implement what a match means for your application:
def match_?(tweet: String, query: Query): Boolean = query match {
case q: TermQuery => tweet.contains(q.getTerm.text)
case q: BooleanQuery =>
// return true if all must clauses are satisfied
// call match_? recursively
// you need to cover all subclasses above
case _ => false
}
val q = queryParser.parse(userQuery)
val res = match_?(tweet, q)
Here is an implementation. It surely has bugs but you'll get the idea and it shows a working proof of concept. It re-uses the syntax, documentation and grammer of the default Lucene QueryParser.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With