Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently check if a string is an approximate substring of (approximately contrained in) another string, up to a given error threshold?

Take two character strings in C or C++, s1 and s2. It is rather trivial to check if one contains the other exactly.

The following will return true in case s2 is a substring of s1.

In C:

strstr(s1, s2)

In C++:

#include <string>
str.find(str2) != string::npos

With boost:

#include <boost/algorithm/string.hpp>
boost::algorithm::contains(s1, s2)

What I am looking for is an analogous way of efficiently (in terms of speed, not memory) finding whether a string is approximately contained in / is approximately a substring of another up to a given difference threshold. Pretty much like the agrep bash command does in Unix systems for finding files.

For example, suppose the function was called approx_contains. Then approx_contains(s1, s2, 4) could return true/false in case s2 is contained within s1 up to a Levenshtein distance of 4.

Searching online, I only found numerous references and questions on how to calculate Levenshtein distance between two strings, or theoretical algorithms for approximate string pattern matching that go beyond simply checking whether or not one string contains another - which here becomes wasteful.

Trying very hard to avoid reinventing the wheel, how could someone do such a checking in C or C++?

like image 385
太晚了 Avatar asked Oct 14 '22 22:10

太晚了


1 Answers

Another option is the library https://github.com/maxbachmann/rapidfuzz-cpp (I am the author) which includes a similarity algorithm called partial_ratio. It performs the following steps:

  1. searches for the longest common subsequences of the two strings
  2. Iterates over those subsequences and calculates a normalized Levenshtein distance between the shorter string and a substring of the longer string, with len(shorter string), that starts at the start of the subsequence
  3. returns score of the optimal alignment

As an example:

#include "rapidfuzz/fuzz.hpp"
#include <string>
std::string a = "txst";
std::string b = "this is a test";
double score = rapidfuzz::fuzz::partial_ratio(a, b);
// score is 75.0

In this example the similarity of the optimal alignment "txst" <-> "test" is calculated. Since only one substitution is required the similarity is 75%.

Getting edit distance instead of relative similarity

As you noted in the comments you are interested in the edit distance instead of relative similarity. Here is a reimplementation of partial_ratio, that does this:

template <typename Sentence1, typename Sentence2>
std::size_t partial_ratio(const Sentence1& s1, const Sentence2& s2, std::size_t max=-1)
{
  auto s1_view = rapidfuzz::common::to_string_view(s1);
  auto s2_view = rapidfuzz::common::to_string_view(s2);

  if (s1_view.empty() && s2_view.empty()) {
    return 0;
  }

  if (s1_view.empty() || s2_view.empty()) {
    return -1;
  }

  // this swap is performed in the original FuzzyWuzzy implementation, so
  // I use it in my implementation as well. Depending on your goals you
  // might want to remove this swap
  if (s1_view.length() > s2_view.length()) {
    return partial_ratio(s2_view, s1_view, max);
  }

  // Note this is a internal API and so it could change at any time
  // currently this is the slowest part, since my bitparallel
  // implementation of the LCS has a bug and so it is disabled for now
  auto blocks = rapidfuzz::detail::get_matching_blocks(s1_view, s2_view);

  // when there is a full match exit early
  for (const auto& block : blocks) {
    if (block.length == s1_view.length()) {
      return 0;
    }
  }

  // find the edit distance at the places with the longest common subsequence
  std::size_t min_dist = (std::size_t)-1;
  for (const auto& block : blocks) {
    std::size_t long_start = (block.dpos > block.spos) ? block.dpos - block.spos : 0;
    auto long_substr = s2_view.substr(long_start, s1_view.length());

    // Note you could use a different weight here like
     // e.g. {1, 1, 1} for the normal Levenshtein distance
    // only {1, 1, 1} and {1, 1, 2} have very fast implementations
    std::size_t dist = rapidfuzz::string_metric::levenshtein(
        s1_view, long_substr, {1, 1, 2}, max);

    if (dist < min_dist) {
      max = min_dist = dist;
    }
  }

  return min_dist;
}

s1and s2can be any type that can be converted to a rapidfuzz::basic_string_view. Some examples are:

  • std::basic_string (std::string, ...)
  • std::basic_string_view (std::string_string, ...) since C++17
  • c strings like char* (this has to find the length once when creating the string_view)
  • many other type, that have data() and .size() and use continous memory. Examples are boost::string_view or std::vector

For results with a edit distance above max (size_t)-1) i is returned instead.

like image 105
maxbachmann Avatar answered Oct 20 '22 15:10

maxbachmann