Fuzzy matching a string within a large body of text in Python (url)

Question

I have a list of company names, and I have a list of url's mentioning company names.

The end goal is to look into the url, and find out how many of the companies on the url are in my list.

Example URL: http://www.dmx.com/about/our-clients

Each URL will be structured differently, so I don't have a good way to do a regex search and create individual strings for each company name.

I'd like build a for loop to search for each company from the list on the entire contents of the URL. But it seems like Levenshtein is better for two smaller strings, vs. a short string and a large body of text.

Where should this beginner be looking?

senderle · Accepted Answer

It doesn't sound to me like you need any "fuzzy" matching. And I'm assuming that when you say "url" you mean "webpage at the address pointed to by the url." Just use Python's built-in substring search functionality:

>>> import urllib2
>>> webpage = urllib2.urlopen('http://www.dmx.com/about/our-clients')
>>> webpage_text = webpage.read()
>>> webpage.close()
>>> for name in ['Caribou Coffee', 'Express', 'Sears']:
...     if name in webpage_text:
...         print name, "found!"
... 
Caribou Coffee found!
Express found!
>>>

If you are worried about string capitalization mismatches, just convert it all to uppercase.

>>> webpage_text = webpage_text.upper()
>>> for name in ['CARIBOU COFFEE', 'EXPRESS', 'SEARS']:
...     if name in webpage_text:
...         print name, 'found!'
... 
CARIBOU COFFEE found!
EXPRESS found!

dr jimbob · Answer

I would add to senderle's answer that it may make sense to normalize your names somehow (e.g., remove all special characters, and then apply it to webpage_text and your list of strings.

def normalize_str(some_str):
    some_str = some_str.lower()
    for c in """-?'"/{}[]()&!,.`""":
        some_str = some_str.replace(c,"")
    return some_str

If this isn't good enough you can go to difflib and do something like:

for client in normalized_client_names:
    closest_client = difflib.get_closest_match(client_name, webpage_text,1,0.8)
    if len(closest_client) > 0:
         print client_name, "found as", closest_client[0]

The arbitrary cutoff I chose (Ratcliff/Obershelp) ratio of 0.8 may be too lenient or tough; play with it a bit.

Fuzzy matching a string within a large body of text in Python (url)

Tags:

python

algorithm

fuzzy-comparison

Kyle

2 Answers

senderle

dr jimbob

Recent Activity

Donate For Us

Fuzzy matching a string within a large body of text in Python (url)

Tags:

python

algorithm

fuzzy-comparison

Kyle

2 Answers

senderle

dr jimbob

Related questions

Recent Activity

Donate For Us