Canonical URL compare in Python?

Tags:

fuzzy-comparison

Are there any tools to do a URL compare in Python?

For example, if I have http://google.com and google.com/ I'd like to know that they are likely to be the same site.

If I were to construct a rule manually, I might Uppercase it, then strip off the http:// portion, and drop anything after the last alpha-numeric character.. But I can see failures of this, as I'm sure you can as well.

Is there a library that does this? How would you do it?

707

asked Jul 19 '10 21:07

Colin Davis

2 Answers

This off the top of my head:

def canonical_url(u):
    u = u.lower()
    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

def same_urls(u1, u2):
    return canonical_url(u1) == canonical_url(u2)

Obviously, there's lots of room for more fiddling with this. Regexes might be better than startswith and endswith, but you get the idea.

123

answered Sep 22 '22 23:09

Ned Batchelder

You could look up the names using dns and see if they point to the same ip. Some minor string processing may be required to remove confusing chars.

from socket import gethostbyname_ex

urls = ['http://google.com','google.com/','www.google.com/','news.google.com']

data = []
for orginalName in urls:
    print 'url:',orginalName
    name = orginalName.strip()
    name = name.replace( 'http://','')
    name = name.replace( 'http:','')
    if name.find('/') > 0:
        name = name[:name.find('/')]
    if name.find('\\') > 0:
        name = name[:name.find('\\')]
    print 'dns lookup:', name
    if name:
        try:
            result = gethostbyname_ex(name)
        except:
            continue # Unable to resolve
        for ip in result[2]:
            print 'ip:', ip
            data.append( (ip, orginalName) )

print data

result:

url: http://google.com
dns lookup: google.com
ip: 66.102.11.104
url: google.com/
dns lookup: google.com
ip: 66.102.11.104
url: www.google.com/
dns lookup: www.google.com
ip: 66.102.11.104
url: news.google.com
dns lookup: news.google.com
ip: 66.102.11.104
[('66.102.11.104', 'http://google.com'), ('66.102.11.104', 'google.com/'), ('66.102.11.104', 'www.google.com/'), ('66.102.11.104', 'news.google.com')]

answered Sep 21 '22 23:09

Martlark

Related questions
                            
                                logistic regression and GridSearchCV using python sklearn
                            
                                Modifying Microsoft Outlook contacts from Python
                            
                                Looking for a self-contained equation rendering library [closed]
                            
                                What are some recommended resources and tutorials for learning the VTK library toolkit? [closed]
                            
                                Using Heapy's Memory Profile Browser with Twisted.web
                            
                                Python: Get name of instantiating class?
                            
                                How do I display notifications from `django-notification`?
                            
                                Finding the correct Python framework with cmake
                            
                                What's the most Pythonic XHTML/HTML parser/generator/template module that supports DOM like access?
                            
                                Interactive mode in matplotlib
                            
                                How to access fields in a struct imported from a .mat file using loadmat in Python?
                            
                                How to get current_app for using with reverse in multi-deployable reusable Django application?
                            
                                Library like fakeweb for Python
                            
                                Is there any way to create a class property in Python?
                            
                                Are there any visual tools for Python unit tests?
                            
                                Python Least-Squares Natural Splines
                            
                                Installing a Python program on Linux
                            
                                How do I limit the amount of login retries in Django
                            
                                Should a modifying class method save itself or be explicity called after the method is called?
                            
                                ncurses and white-on-black

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With