I have a Java based application and a set of keywords in a MySQL database (in total about 3M keywords, each of them may include more than one words, eg it can be: “memory”, “old house”, “European Union law”, etc).
The user interacts with the application by uploading a document with arbitrary text (several pages most of the times). What I want to do is to search if and where in the document any of the 3 million keywords appear.
I have tried using a loop and searching the document for each keyword but this is not efficient at all. I am wondering if there is a library to perform the search in a more time efficient manner.
I would greatly appreciate any help.
You can use contains(), indexOf() and lastIndexOf() method to check if one String contains another String in Java or not. If a String contains another String then it's known as a substring. The indexOf() method accepts a String and returns the starting position of the string if it exists, otherwise, it will return -1.
The contains() method of List interface in Java is used for checking if the specified element exists in the given list or not.
Answer: Use the PHP strpos() Function You can use the PHP strpos() function to check whether a string contains a specific word or not. The strpos() function returns the position of the first occurrence of a substring in a string. If the substring is not found it returns false .
You can search for a particular letter in a string using the indexOf() method of the String class. This method which returns a position index of a word within the string if found. Otherwise it returns -1.
project Apache Lucene may be helpful.
Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
you can find some useful tutorials here
You could try using a bloom filter http://en.wikipedia.org/wiki/Bloom_filter. Then check each word(s) against the bloom filter to find out positives. Please remember there could be false positives. Therefore if there are positives from the bloom filter then you could try a sql query like 'select keyword from keywordtable where keyword in (positives from bloom filter) ' to concretely identify which keywords are present in the uploaded document.
Java implementation of bloom filter available in Guava library. http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/hash/BloomFilter.html
You can use The Lemur Project also available at sourceforge:
The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software, including the Indri search engine and ClueWeb09 dataset.
And as Recommended by Taher the Apache Lucene is a nice tool, And I've used both of them and they're great.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With