Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python difflib's ratio, quick_ratio and real_quick_ratio

Tags:

python

diff

I've been using difflib's SequenceMatcher,

And I found the ratio function to be too slow. Reading through the documentation, I discovered quick_ratio and real_quick_ratio that are supposed to be quicker (as the name suggests) and serve as an upper bound.

However, the documentation lacks the description on the assumption they make, or on the speedup they offer.

When should I use either version, and what do I sacrifice ?

like image 762
Uri Goren Avatar asked May 23 '18 11:05

Uri Goren


People also ask

How does Difflib work in python?

The difflib module provides classes and functions for comparing sequences. It can be used to compare files and can produce information about file differences in various formats. This class can be used to compare two input sequences or strings.

Is Difflib standard python?

This module in the python standard library provides classes and functions for comparing sequences like strings, lists etc.

What is Difflib module in python?

This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce information about file differences in various formats, including HTML and context and unified diffs.

How does SequenceMatcher work in python?

SequenceMatcher FlowChart It sums the sizes of all matched sequences returned by function get_matching_blocks and calculates the ratio as: ratio = 2.0*M / T , where M = matches , T = total number of elements in both sequences. get_matching_blocks( ) return list of triples describing matching subsequences.


1 Answers

Taking a look

Starting off with the helper method _calculate_ratio

def _calculate_ratio(matches, length):
    if length:
        return 2.0 * matches / length
    return 1.0

ratio

ratio finds matches, and divides that by the total length of both strings times 2:

    return _calculate_ratio(matches, len(self.a) + len(self.b))

quick_ratio

This is actually what the source commentary says:

    # viewing a and b as multisets, set matches to the cardinality
    # of their intersection; this counts the number of matches
    # without regard to order, so is clearly an upper bound

and then

    return _calculate_ratio(matches, len(self.a) + len(self.b))

real_quick_ratio

real_quick_ratio finds the shortest string, divided by the total length of the strings times 2:

    la, lb = len(self.a), len(self.b)
    # can't have more matches than the number of elements in the
    # shorter sequence
    return _calculate_ratio(min(la, lb), la + lb)

this is the real upper bound.

Conclusion

real_quick_ratio does nothing to look at the strings to see if there are any matches, it only computes an upper bound based on string length.

Now, I'm not an algorithm guy, but if you think ratio is too slow to get the job done, I recommend using quick_ratio, since it treats the problem adequately.

Note on efficiency

From the docstring

    .ratio() is expensive to compute if you haven't already computed
    .get_matching_blocks() or .get_opcodes(), in which case you may
    want to try .quick_ratio() or .real_quick_ratio() first to get an
    upper bound.
like image 158
Pax Vobiscum Avatar answered Oct 15 '22 14:10

Pax Vobiscum