Take two character strings in C or C++, s1
and s2
. It is rather trivial to check if one contains the other exactly.
The following will return true
in case s2
is a substring of s1
.
In C:
strstr(s1, s2)
In C++:
#include <string>
str.find(str2) != string::npos
With boost:
#include <boost/algorithm/string.hpp>
boost::algorithm::contains(s1, s2)
What I am looking for is an analogous way of efficiently (in terms of speed, not memory) finding whether a string is approximately contained in / is approximately a substring of another up to a given difference threshold. Pretty much like the agrep
bash command does in Unix systems for finding files.
For example, suppose the function was called approx_contains
. Then approx_contains(s1, s2, 4)
could return true/false in case s2
is contained within s1
up to a Levenshtein distance of 4.
Searching online, I only found numerous references and questions on how to calculate Levenshtein distance between two strings, or theoretical algorithms for approximate string pattern matching that go beyond simply checking whether or not one string contains another - which here becomes wasteful.
Trying very hard to avoid reinventing the wheel, how could someone do such a checking in C or C++?
Another option is the library https://github.com/maxbachmann/rapidfuzz-cpp (I am the author) which includes a similarity algorithm called partial_ratio
.
It performs the following steps:
As an example:
#include "rapidfuzz/fuzz.hpp"
#include <string>
std::string a = "txst";
std::string b = "this is a test";
double score = rapidfuzz::fuzz::partial_ratio(a, b);
// score is 75.0
In this example the similarity of the optimal alignment "txst" <-> "test" is calculated. Since only one substitution is required the similarity is 75%.
As you noted in the comments you are interested in the edit distance instead of relative similarity. Here is a reimplementation of partial_ratio, that does this:
template <typename Sentence1, typename Sentence2>
std::size_t partial_ratio(const Sentence1& s1, const Sentence2& s2, std::size_t max=-1)
{
auto s1_view = rapidfuzz::common::to_string_view(s1);
auto s2_view = rapidfuzz::common::to_string_view(s2);
if (s1_view.empty() && s2_view.empty()) {
return 0;
}
if (s1_view.empty() || s2_view.empty()) {
return -1;
}
// this swap is performed in the original FuzzyWuzzy implementation, so
// I use it in my implementation as well. Depending on your goals you
// might want to remove this swap
if (s1_view.length() > s2_view.length()) {
return partial_ratio(s2_view, s1_view, max);
}
// Note this is a internal API and so it could change at any time
// currently this is the slowest part, since my bitparallel
// implementation of the LCS has a bug and so it is disabled for now
auto blocks = rapidfuzz::detail::get_matching_blocks(s1_view, s2_view);
// when there is a full match exit early
for (const auto& block : blocks) {
if (block.length == s1_view.length()) {
return 0;
}
}
// find the edit distance at the places with the longest common subsequence
std::size_t min_dist = (std::size_t)-1;
for (const auto& block : blocks) {
std::size_t long_start = (block.dpos > block.spos) ? block.dpos - block.spos : 0;
auto long_substr = s2_view.substr(long_start, s1_view.length());
// Note you could use a different weight here like
// e.g. {1, 1, 1} for the normal Levenshtein distance
// only {1, 1, 1} and {1, 1, 2} have very fast implementations
std::size_t dist = rapidfuzz::string_metric::levenshtein(
s1_view, long_substr, {1, 1, 2}, max);
if (dist < min_dist) {
max = min_dist = dist;
}
}
return min_dist;
}
s1
and s2
can be any type that can be converted to a rapidfuzz::basic_string_view. Some examples are:
data()
and .size()
and use continous memory. Examples are boost::string_view or std::vectorFor results with a edit distance above max (size_t)-1) i
is returned instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With