Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fuzzy matching a string within a large body of text in Python (url)

I have a list of company names, and I have a list of url's mentioning company names.

The end goal is to look into the url, and find out how many of the companies on the url are in my list.

Example URL: http://www.dmx.com/about/our-clients

Each URL will be structured differently, so I don't have a good way to do a regex search and create individual strings for each company name.

I'd like build a for loop to search for each company from the list on the entire contents of the URL. But it seems like Levenshtein is better for two smaller strings, vs. a short string and a large body of text.

Where should this beginner be looking?

like image 818
Kyle Avatar asked Feb 24 '23 02:02

Kyle


2 Answers

It doesn't sound to me like you need any "fuzzy" matching. And I'm assuming that when you say "url" you mean "webpage at the address pointed to by the url." Just use Python's built-in substring search functionality:

>>> import urllib2
>>> webpage = urllib2.urlopen('http://www.dmx.com/about/our-clients')
>>> webpage_text = webpage.read()
>>> webpage.close()
>>> for name in ['Caribou Coffee', 'Express', 'Sears']:
...     if name in webpage_text:
...         print name, "found!"
... 
Caribou Coffee found!
Express found!
>>> 

If you are worried about string capitalization mismatches, just convert it all to uppercase.

>>> webpage_text = webpage_text.upper()
>>> for name in ['CARIBOU COFFEE', 'EXPRESS', 'SEARS']:
...     if name in webpage_text:
...         print name, 'found!'
... 
CARIBOU COFFEE found!
EXPRESS found!
like image 115
senderle Avatar answered Feb 26 '23 21:02

senderle


I would add to senderle's answer that it may make sense to normalize your names somehow (e.g., remove all special characters, and then apply it to webpage_text and your list of strings.

def normalize_str(some_str):
    some_str = some_str.lower()
    for c in """-?'"/{}[]()&!,.`""":
        some_str = some_str.replace(c,"")
    return some_str

If this isn't good enough you can go to difflib and do something like:

for client in normalized_client_names:
    closest_client = difflib.get_closest_match(client_name, webpage_text,1,0.8)
    if len(closest_client) > 0:
         print client_name, "found as", closest_client[0]

The arbitrary cutoff I chose (Ratcliff/Obershelp) ratio of 0.8 may be too lenient or tough; play with it a bit.

like image 23
dr jimbob Avatar answered Feb 26 '23 21:02

dr jimbob