Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple keyword search in Java

I have a Java based application and a set of keywords in a MySQL database (in total about 3M keywords, each of them may include more than one words, eg it can be: “memory”, “old house”, “European Union law”, etc).

The user interacts with the application by uploading a document with arbitrary text (several pages most of the times). What I want to do is to search if and where in the document any of the 3 million keywords appear.

I have tried using a loop and searching the document for each keyword but this is not efficient at all. I am wondering if there is a library to perform the search in a more time efficient manner.

I would greatly appreciate any help.

like image 921
Nikolaos Papadakis Avatar asked Feb 03 '15 06:02

Nikolaos Papadakis


People also ask

How do I check if multiple contains a string?

You can use contains(), indexOf() and lastIndexOf() method to check if one String contains another String in Java or not. If a String contains another String then it's known as a substring. The indexOf() method accepts a String and returns the starting position of the string if it exists, otherwise, it will return -1.

How do you check if a list of strings contains a string in Java?

The contains() method of List interface in Java is used for checking if the specified element exists in the given list or not.

How do you check if a string contains a keyword?

Answer: Use the PHP strpos() Function You can use the PHP strpos() function to check whether a string contains a specific word or not. The strpos() function returns the position of the first occurrence of a substring in a string. If the substring is not found it returns false .

What is string search in Java?

You can search for a particular letter in a string using the indexOf() method of the String class. This method which returns a position index of a word within the string if found. Otherwise it returns -1.


3 Answers

project Apache Lucene may be helpful.

Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

you can find some useful tutorials here

like image 167
Taher Khorshidi Avatar answered Oct 05 '22 12:10

Taher Khorshidi


You could try using a bloom filter http://en.wikipedia.org/wiki/Bloom_filter. Then check each word(s) against the bloom filter to find out positives. Please remember there could be false positives. Therefore if there are positives from the bloom filter then you could try a sql query like 'select keyword from keywordtable where keyword in (positives from bloom filter) ' to concretely identify which keywords are present in the uploaded document.

Java implementation of bloom filter available in Guava library. http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/hash/BloomFilter.html

like image 21
Dev Blanked Avatar answered Oct 05 '22 10:10

Dev Blanked


You can use The Lemur Project also available at sourceforge:

The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software, including the Indri search engine and ClueWeb09 dataset.

And as Recommended by Taher the Apache Lucene is a nice tool, And I've used both of them and they're great.

like image 30
cнŝdk Avatar answered Oct 05 '22 10:10

cнŝdk