Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checking if element in list by substring

Tags:

python

list

I have a list of urls (unicode), and there is a lot of repetition. For example, urls http://www.myurlnumber1.com and http://www.myurlnumber1.com/foo+%bar%baz%qux lead to the same place.

So I need to weed out all of those duplicates.

My first idea was to check if the element's substring is in the list, like so:

for url in list:
    if url[:30] not in list:
        print(url)

However, it tries to mach literal url[:30] to a list element and obviously returns all of them, since there is no element that exactly matches url[:30].

Is there an easy way to solve this problem?

EDIT:

Often the host and path in the urls stays the same, but the parameters are different. For my purposes, a url with the same hostname and path, but different parameters are still the same url and constitute a duplicate.

like image 293
Zlo Avatar asked Sep 27 '16 13:09

Zlo


Video Answer


1 Answers

If you consider any netloc's to be the same you can parse with urllib.parse

from urllib.parse import  urlparse # python2 from urlparse import  urlparse 

u = "http://www.myurlnumber1.com/foo+%bar%baz%qux"

print(urlparse(u).netloc)

Which would give you:

www.myurlnumber1.com

So to get unique netlocs you could do something like:

unique  = {urlparse(u).netloc for u in urls}

If you wanted to keep the url scheme:

urls  = ["http://www.myurlnumber1.com/foo+%bar%baz%qux", "http://www.myurlnumber1.com"]

unique = {"{}://{}".format(u.scheme, u.netloc) for u in map(urlparse, urls)}
print(unique)

Presuming they all have schemes and you don't have http and https for the same netloc and consider them to be the same.

If you also want to add the path:

unique = {u.netloc, u.path) for u in map(urlparse, urls)}

The table of attributes is listed in the docs:

Attribute   Index   Value   Value if not present
scheme  0   URL scheme specifier    scheme parameter
netloc  1   Network location part   empty string
path    2   Hierarchical path   empty string
params  3   Parameters for last path element    empty string
query   4   Query component empty string
fragment    5   Fragment identifier empty string
username        User name   None
password        Password    None
hostname        Host name (lower case)  None
port        Port number as integer, if present  None

You just need to use whatever you consider to be the unique parts.

In [1]: from urllib.parse import  urlparse

In [2]: urls = ["http://www.url.com/foo-bar", "http://www.url.com/foo-bar?t=baz", "www.url.com/baz-qux",  "www.url.com/foo-bar?t=baz"]


In [3]: unique = {"".join((u.netloc, u.path)) for u in map(urlparse, urls)}

In [4]: 

In [4]: print(unique)
{'www.url.com/baz-qux', 'www.url.com/foo-bar'}
like image 126
Padraic Cunningham Avatar answered Sep 30 '22 08:09

Padraic Cunningham