Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fuzzy .substring text-matching function

I am looking for a way to fuzzy substring function. What do I mean by that:

  • Two strings are given.
  • One is often longer than the other one. Let's call then "short" and "long"
  • We want to score how much of the "short" has appeared in the "long".
  • We want to take into account proximity and oder. Like if the elements of the "short" appear in the "long", they are preferred to appear in the same order and close to each other.

Example 1:

  • Short: "weeds are destroyed"
  • Long: "Crops engineered with a bacterial gene making the plants resistant to herbicides can grow while weeds are destroyed, and genetically engineered crops that can resist destructive insects reduce the need for chemical insecticides."

This is an exact match and should have score 1.0.

Example 2:

  • Short: "weeds will be destroyed"
  • Long: Same as above.

This is a fuzzy match, as "weed" and "destroyed" appear in the text, but without "will be". Still it should get a high-score (say 0.8).

Example 3:

If we set the "Short" to "destroyed will be weeds", although "destroyed" and "weeds" both appear in the original text, the score should be very low, since their order has changed.

Any suggested implementation on this?

A final point is that, there is no unique way of doing this scoring. But I am looking for AN algorithm. The parameters of this algorithm can be tuned based on the needs and requirements.

like image 616
Daniel Avatar asked Feb 19 '17 18:02

Daniel


2 Answers

Here's one possible approach:

  1. for the first word short(0), store the first indexOf in long
  2. for each subsequent word short(n), store both of: a) the first indexOf in long, and b) (preferred) the first indexOf short(n) which occurs no later than the preferred indexOf short(n-1).
  3. score accordingly
like image 131
Patrick Parker Avatar answered Nov 19 '22 02:11

Patrick Parker


I'd split both strings in dependency tree (something like this). Then, recursively traversed smaller tree from root and checked whether token is present in bigger tree. If yes, then add score similarity_of_dependency_kind. Optionally, that can be multiplied by similarity_of_destination_words (in terms of synonymicity, something like wordnet).

This approach is less efficient, but more accurate.

Also, don't forget preliminary data cleaning, like typos correction.

like image 2
dveim Avatar answered Nov 19 '22 02:11

dveim