I have a list of urls (unicode
), and there is a lot of repetition.
For example, urls http://www.myurlnumber1.com
and http://www.myurlnumber1.com/foo+%bar%baz%qux
lead to the same place.
So I need to weed out all of those duplicates.
My first idea was to check if the element's substring is in the list, like so:
for url in list:
if url[:30] not in list:
print(url)
However, it tries to mach literal url[:30]
to a list element and obviously returns all of them, since there is no element that exactly matches url[:30]
.
Is there an easy way to solve this problem?
EDIT:
Often the host and path in the urls stays the same, but the parameters are different. For my purposes, a url with the same hostname and path, but different parameters are still the same url and constitute a duplicate.
If you consider any netloc's to be the same you can parse with urllib.parse
from urllib.parse import urlparse # python2 from urlparse import urlparse
u = "http://www.myurlnumber1.com/foo+%bar%baz%qux"
print(urlparse(u).netloc)
Which would give you:
www.myurlnumber1.com
So to get unique netlocs you could do something like:
unique = {urlparse(u).netloc for u in urls}
If you wanted to keep the url scheme:
urls = ["http://www.myurlnumber1.com/foo+%bar%baz%qux", "http://www.myurlnumber1.com"]
unique = {"{}://{}".format(u.scheme, u.netloc) for u in map(urlparse, urls)}
print(unique)
Presuming they all have schemes and you don't have http and https for the same netloc and consider them to be the same.
If you also want to add the path:
unique = {u.netloc, u.path) for u in map(urlparse, urls)}
The table of attributes is listed in the docs:
Attribute Index Value Value if not present
scheme 0 URL scheme specifier scheme parameter
netloc 1 Network location part empty string
path 2 Hierarchical path empty string
params 3 Parameters for last path element empty string
query 4 Query component empty string
fragment 5 Fragment identifier empty string
username User name None
password Password None
hostname Host name (lower case) None
port Port number as integer, if present None
You just need to use whatever you consider to be the unique parts.
In [1]: from urllib.parse import urlparse
In [2]: urls = ["http://www.url.com/foo-bar", "http://www.url.com/foo-bar?t=baz", "www.url.com/baz-qux", "www.url.com/foo-bar?t=baz"]
In [3]: unique = {"".join((u.netloc, u.path)) for u in map(urlparse, urls)}
In [4]:
In [4]: print(unique)
{'www.url.com/baz-qux', 'www.url.com/foo-bar'}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With