I have a list of company names, and I have a list of url's mentioning company names.
The end goal is to look into the url, and find out how many of the companies on the url are in my list.
Example URL: http://www.dmx.com/about/our-clients
Each URL will be structured differently, so I don't have a good way to do a regex search and create individual strings for each company name.
I'd like build a for loop to search for each company from the list on the entire contents of the URL. But it seems like Levenshtein is better for two smaller strings, vs. a short string and a large body of text.
Where should this beginner be looking?
It doesn't sound to me like you need any "fuzzy" matching. And I'm assuming that when you say "url" you mean "webpage at the address pointed to by the url." Just use Python's built-in substring search functionality:
>>> import urllib2
>>> webpage = urllib2.urlopen('http://www.dmx.com/about/our-clients')
>>> webpage_text = webpage.read()
>>> webpage.close()
>>> for name in ['Caribou Coffee', 'Express', 'Sears']:
... if name in webpage_text:
... print name, "found!"
...
Caribou Coffee found!
Express found!
>>>
If you are worried about string capitalization mismatches, just convert it all to uppercase.
>>> webpage_text = webpage_text.upper()
>>> for name in ['CARIBOU COFFEE', 'EXPRESS', 'SEARS']:
... if name in webpage_text:
... print name, 'found!'
...
CARIBOU COFFEE found!
EXPRESS found!
I would add to senderle's answer that it may make sense to normalize your names somehow (e.g., remove all special characters, and then apply it to webpage_text and your list of strings.
def normalize_str(some_str):
some_str = some_str.lower()
for c in """-?'"/{}[]()&!,.`""":
some_str = some_str.replace(c,"")
return some_str
If this isn't good enough you can go to difflib and do something like:
for client in normalized_client_names:
closest_client = difflib.get_closest_match(client_name, webpage_text,1,0.8)
if len(closest_client) > 0:
print client_name, "found as", closest_client[0]
The arbitrary cutoff I chose (Ratcliff/Obershelp) ratio of 0.8 may be too lenient or tough; play with it a bit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With