Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does SequenceMatcher.ratio works in difflib

I was trying out python's difflib module and I came across SequenceMatcher. So, I tried the following examples but couldn't understand what is happening.

>>> SequenceMatcher(None,"abc","a").ratio()
0.5

>>> SequenceMatcher(None,"aabc","a").ratio()
0.4

>>> SequenceMatcher(None,"aabc","aa").ratio()
0.6666666666666666

Now, according to the ratio:

Return a measure of the sequences' similarity as a float in the range [0, 1]. Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T.

so, for my cases:

  1. T=4 and M=1 so ratio 2*1/4 = 0.5
  2. T=5 and M=2 so ratio 2*2/5 = 0.8
  3. T=6 and M=1 so ratio 2*1/6.0 = 0.33

According to my understanding T = len(aabc) + len(a) and M=2 because a comes twice in aabc.

So, where am I getting wrong what am I missing.?

Here is the source code of SequenceMatcher.ratio()

like image 247
RanRag Avatar asked Sep 15 '12 10:09

RanRag


People also ask

How does Difflib SequenceMatcher work?

SequenceMatcher is a class that is available in the difflib Python package. The difflib module provides classes and functions for comparing sequences. It can be used to compare files and can produce information about file differences in various formats. This class can be used to compare two input sequences or strings.

What algorithm does SequenceMatcher use?

SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching".

How does Difflib work in Python?

Difflib is a Python module that contains several easy-to-use functions and classes that allow users to compare sets of data. The module presents the results of these sequence comparisons in a human-readable format, utilizing deltas to display the differences more cleanly.

What is Difflib used for?

This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce information about file differences in various formats, including HTML and context and unified diffs.


1 Answers

You've got the first case right. In the second case, only one a from aabc matches, so M = 1. In the third example, both as match so M = 2.

[P.S.: you're referring to the ancient Python 2.4 source code. The current source code is at hg.python.org.]

like image 178
Fred Foo Avatar answered Oct 25 '22 01:10

Fred Foo