I've been using difflib's SequenceMatcher
,
And I found the ratio
function to be too slow.
Reading through the documentation, I discovered quick_ratio
and real_quick_ratio
that are supposed to be quicker (as the name suggests) and serve as an upper bound.
However, the documentation lacks the description on the assumption they make, or on the speedup they offer.
When should I use either version, and what do I sacrifice ?
The difflib module provides classes and functions for comparing sequences. It can be used to compare files and can produce information about file differences in various formats. This class can be used to compare two input sequences or strings.
This module in the python standard library provides classes and functions for comparing sequences like strings, lists etc.
This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce information about file differences in various formats, including HTML and context and unified diffs.
SequenceMatcher FlowChart It sums the sizes of all matched sequences returned by function get_matching_blocks and calculates the ratio as: ratio = 2.0*M / T , where M = matches , T = total number of elements in both sequences. get_matching_blocks( ) return list of triples describing matching subsequences.
Starting off with the helper method _calculate_ratio
def _calculate_ratio(matches, length):
if length:
return 2.0 * matches / length
return 1.0
ratio
finds matches, and divides that by the total length of both strings times 2:
return _calculate_ratio(matches, len(self.a) + len(self.b))
This is actually what the source commentary says:
# viewing a and b as multisets, set matches to the cardinality
# of their intersection; this counts the number of matches
# without regard to order, so is clearly an upper bound
and then
return _calculate_ratio(matches, len(self.a) + len(self.b))
real_quick_ratio
finds the shortest string, divided by the total length of the strings times 2:
la, lb = len(self.a), len(self.b)
# can't have more matches than the number of elements in the
# shorter sequence
return _calculate_ratio(min(la, lb), la + lb)
this is the real upper bound.
real_quick_ratio
does nothing to look at the strings to see if there are any matches, it only computes an upper bound based on string length.
Now, I'm not an algorithm guy, but if you think ratio
is too slow to get the job done, I recommend using quick_ratio
, since it treats the problem adequately.
From the docstring
.ratio() is expensive to compute if you haven't already computed
.get_matching_blocks() or .get_opcodes(), in which case you may
want to try .quick_ratio() or .real_quick_ratio() first to get an
upper bound.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With