Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a lot of text based on a constant set of search terms

Tags:

java

algorithm

I have a set of search terms like [+dog -"jack russels" +"fox terrier"], [+cat +persian -tabby]. These could be quite long with maybe 30 sub-terms making up each term.

I now have some online news articles extracts such as ["My fox terrier is the cutest dog in the world..."] and ["Has anyone seen my lost persian cat? He went missing ..."]. They're not too long, perhaps 500 characters at most each.

In traditional search engines one expects a huge amount of articles that are pre-processed into indexes, allowing for speed-ups when searching given 'search terms', using set theory/boolean logic to reduce articles to only ones that match the phrases. In this situation, however, the order of my search terms is ~10^5, and I'd like to be able to process a single article at a time, to see ALL the sets of search terms that article would be matched with (i.e. all the + terms are in the text and none of the - terms).

I have a possible solution using two maps (one for the positive sub-phrases, one for the negative sub-phrases), but I don't think it'll be very efficient.

First prize would be a library that solves this problem, second prize is a push in the right direction towards solving this.

Kind regards,

like image 793
Noxville Avatar asked Nov 13 '22 05:11

Noxville


1 Answers

Assuming all the positive sub-terms are required for a match:

Put all the sub-terms from your search terms into a hashtable. The sub-term is the key, the value is a pointer to the full search term data structure (which should include a unique id and a map of sub-terms to a boolean).

Additionally, when processing a news item, create a "candidates" map, indexed by the term id. Each candidate structure has a pointer to the term definition, a set that contains the seen sub-terms and a "rejected" flag.

Iterate over the words of the news article.

For each hit, look up the candidate entry. If not there, create and add an empty one.

If the candidate rejection flag is set, you are done.

Otherwise, look up the sub-term from the term data structure. If negative, set the rejected flag. If positive, add the sub-term to the set of seen sub-terms.

In the end, iterate over the candidates. All candidates that are not rejected and where the size of the seen set equals to the number of positive sub-terms of that term are your hits.

Implementation: https://docs.google.com/document/d/1boieLJboLTy7X2NH1Grybik4ERTpDtFVggjZeEDQH74/edit

Runtime is O(n * m) where n is the number of words in the article and m is the maximum number of terms sharing the same sub-term (expected to be relatively small).

like image 87
Stefan Haustein Avatar answered Nov 16 '22 04:11

Stefan Haustein