I am having list of urls like ,
l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']
I just want to see the full url from the short one for every element in that list.
Here is my approach,
import urllib2
for i in l:
print urllib2.urlopen(i).url
But when list contains thousands of url , the program takes long time.
My question : Is there is any way to reduce execution time or any other approach I have to follow ?
First method
As suggested, one way to accomplish the task would be to use the official api to bitly, which has, however, limitations (e.g., no more than 15 shortUrl
's per request).
Second method
As an alternative, one could just avoid getting the contents, e.g. by using the HEAD
HTTP method instead of GET
. Here is just a sample code, which makes use of the excellent requests package:
import requests
l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']
for i in l:
print requests.head("http://"+i).headers['location']
I'd try twisted's asynchronous web client. Be careful with this, though, it doesn't rate-limit at all.
#!/usr/bin/python2.7
from twisted.internet import reactor
from twisted.internet.defer import Deferred, DeferredList, DeferredLock
from twisted.internet.defer import inlineCallbacks
from twisted.web.client import Agent, HTTPConnectionPool
from twisted.web.http_headers import Headers
from pprint import pprint
from collections import defaultdict
from urlparse import urlparse
from random import randrange
import fileinput
pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = 16
agent = Agent(reactor, pool)
locks = defaultdict(DeferredLock)
locations = {}
def getLock(url, simultaneous = 1):
return locks[urlparse(url).netloc, randrange(simultaneous)]
@inlineCallbacks
def getMapping(url):
# Limit ourselves to 4 simultaneous connections per host
# Tweak this as desired, but make sure that it no larger than
# pool.maxPersistentPerHost
lock = getLock(url,4)
yield lock.acquire()
try:
resp = yield agent.request('HEAD', url)
locations[url] = resp.headers.getRawHeaders('location',[None])[0]
except Exception as e:
locations[url] = str(e)
finally:
lock.release()
dl = DeferredList(getMapping(url.strip()) for url in fileinput.input())
dl.addCallback(lambda _: reactor.stop())
reactor.run()
pprint(locations)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With