Python difflib's ratio, quick_ratio and real_quick_ratio

Tags:

diff

I've been using difflib's SequenceMatcher,

And I found the ratio function to be too slow. Reading through the documentation, I discovered quick_ratio and real_quick_ratio that are supposed to be quicker (as the name suggests) and serve as an upper bound.

However, the documentation lacks the description on the assumption they make, or on the speedup they offer.

When should I use either version, and what do I sacrifice ?

762

asked May 23 '18 11:05

Uri Goren

1 Answers

Taking a look

Starting off with the helper method _calculate_ratio

def _calculate_ratio(matches, length):
    if length:
        return 2.0 * matches / length
    return 1.0

ratio

ratio finds matches, and divides that by the total length of both strings times 2:

    return _calculate_ratio(matches, len(self.a) + len(self.b))

quick_ratio

This is actually what the source commentary says:

    # viewing a and b as multisets, set matches to the cardinality
    # of their intersection; this counts the number of matches
    # without regard to order, so is clearly an upper bound

and then

    return _calculate_ratio(matches, len(self.a) + len(self.b))

real_quick_ratio

real_quick_ratio finds the shortest string, divided by the total length of the strings times 2:

    la, lb = len(self.a), len(self.b)
    # can't have more matches than the number of elements in the
    # shorter sequence
    return _calculate_ratio(min(la, lb), la + lb)

this is the real upper bound.

Conclusion

real_quick_ratio does nothing to look at the strings to see if there are any matches, it only computes an upper bound based on string length.

Now, I'm not an algorithm guy, but if you think ratio is too slow to get the job done, I recommend using quick_ratio, since it treats the problem adequately.

Note on efficiency

From the docstring

    .ratio() is expensive to compute if you haven't already computed
    .get_matching_blocks() or .get_opcodes(), in which case you may
    want to try .quick_ratio() or .real_quick_ratio() first to get an
    upper bound.

158

answered Oct 15 '22 14:10

Pax Vobiscum

Related questions
                            
                                Python - Reading and writing csv files with utf-8 encoding
                            
                                Python Pandas Sum Values in Columns If date between 2 dates
                            
                                Why would I use int( input().strip() ) instead of just int( input() ) in Python?
                            
                                Tk(), Toplevel() and winfo_toplevel(). Difference between them and how and when to use effectively?
                            
                                How to keep track of instances of python objects in a reliable way?
                            
                                imdecode returns None Python opencv2
                            
                                Searching for all Unicode variation of hyphens in Python
                            
                                Regex - Replace \\n and \n in string by <br> but not \\\\n
                            
                                moto not mocking ec2?
                            
                                Catch all exceptions except user abort
                            
                                Pyinstaller: Cannot open shared object libpython3.5m.so.1.0
                            
                                Limit/Filter Foreign Key Choices in Django Admin
                            
                                How can I get the noun clause that is the object of a certain verb?
                            
                                Rand Index function (clustering performance evaluation)
                            
                                2d boolean selection in 3d matrix
                            
                                Bounding box on objects based on color python
                            
                                The conversion from csv to binary format reduces the file size abnormally
                            
                                How does numpy addition work?
                            
                                Recommended way to find the source of a query when using Django?
                            
                                How to run multiple keras programs on single gpu?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With