What I am striving to complete is a program which reads in a file and will compare each sentence according to the original sentence. The sentence which is a perfect match to the original will receive a score of 1 and a sentence which is the total opposite will receive a 0. All other fuzzy sentences will receive a grade in between 1 and 0. I am unsure which operation to use to allow me to complete this in Python 3. I have included the sample text in which the Text 1 is the original and the other preceding strings are the comparisons. <h3>Text: Sample</h3> Text 1: It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats. Text 20: It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines // Should score high point but not 1 Text 21: It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines // Should score lower than text 20 Text 22: I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night. // Should score lower than text 21 but NOT 0 Text 24: It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats. // Should score a 0!

There is a package called <code>fuzzywuzzy</code>. Install via pip: <pre class="prettyprint"><code>pip install fuzzywuzzy </code></pre> Simple usage: <pre class="prettyprint"><code>>>> from fuzzywuzzy import fuzz >>> fuzz.ratio("this is a test", "this is a test!") 96 </code></pre> The package is built on top of <code>difflib</code>. Why not just use that, you ask? Apart from being a bit simpler, it has a number of different matching methods (like token order insensitivity, partial string matching) which make it more powerful in practice. The <code>process.extract</code> functions are especially useful: find the best matching strings and ratios from a set. From their readme: <blockquote> Partial Ratio </blockquote> <pre class="prettyprint"><code>>>> fuzz.partial_ratio("this is a test", "this is a test!") 100 </code></pre> <blockquote> Token Sort Ratio </blockquote> <pre class="prettyprint"><code>>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 90 >>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 100 </code></pre> <blockquote> Token Set Ratio </blockquote> <pre class="prettyprint"><code>>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 84 >>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 100 </code></pre> <blockquote> Process </blockquote> <pre class="prettyprint"><code>>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"] >>> process.extract("new york jets", choices, limit=2) [('New York Jets', 100), ('New York Giants', 78)] >>> process.extractOne("cowboys", choices) ("Dallas Cowboys", 90) </code></pre>

Fuzzy String Comparison

Tags:

python

nlp

fuzzy-comparison

What I am striving to complete is a program which reads in a file and will compare each sentence according to the original sentence. The sentence which is a perfect match to the original will receive a score of 1 and a sentence which is the total opposite will receive a 0. All other fuzzy sentences will receive a grade in between 1 and 0.

I am unsure which operation to use to allow me to complete this in Python 3.

I have included the sample text in which the Text 1 is the original and the other preceding strings are the comparisons.

Text: Sample

Text 1: It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.

Text 20: It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines // Should score high point but not 1

Text 21: It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines // Should score lower than text 20

Text 22: I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night. // Should score lower than text 21 but NOT 0

Text 24: It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats. // Should score a 0!

519

asked Apr 30 '12 11:04

jacksonstephenc

1 Answers

There is a package called fuzzywuzzy. Install via pip:

pip install fuzzywuzzy

Simple usage:

>>> from fuzzywuzzy import fuzz >>> fuzz.ratio("this is a test", "this is a test!")     96

The package is built on top of difflib. Why not just use that, you ask? Apart from being a bit simpler, it has a number of different matching methods (like token order insensitivity, partial string matching) which make it more powerful in practice. The process.extract functions are especially useful: find the best matching strings and ratios from a set. From their readme:

Partial Ratio

>>> fuzz.partial_ratio("this is a test", "this is a test!")     100

Token Sort Ratio

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")     90 >>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")     100

Token Set Ratio

>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")     84 >>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")     100

Process

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"] >>> process.extract("new york jets", choices, limit=2)     [('New York Jets', 100), ('New York Giants', 78)] >>> process.extractOne("cowboys", choices)     ("Dallas Cowboys", 90)

147

answered Oct 04 '22 02:10

congusbongus

Related questions
                            
                                Get unicode code point of a character using Python
                            
                                Why use **kwargs in python? What are some real world advantages over using named arguments?
                            
                                How to find names of all collections using PyMongo?
                            
                                Python Decimals format
                            
                                sql.h not found when installing PyODBC on Heroku
                            
                                Plotting multiple lines, in different colors, with pandas dataframe
                            
                                Getting "cannot write mode P as JPEG" while operating on JPG image
                            
                                How can I get the username of the logged-in user in Django?
                            
                                django - comparing old and new field value before saving
                            
                                Wrong math with Python?
                            
                                Tuple unpacking in for loops
                            
                                How to qcut with non unique bin edges?
                            
                                django modifying the request object
                            
                                Python function pointer
                            
                                Why don't my south migrations work?
                            
                                Trying to parse `request.body` from POST in Django [duplicate]
                            
                                When and how to use the builtin function property() in python
                            
                                Opposite of any() function
                            
                                Delete multiple files matching a pattern
                            
                                Getting error ImportMismatchError while running py.test

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With