Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get full url from shorten url using python

Tags:

python

I am having list of urls like ,

l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']

I just want to see the full url from the short one for every element in that list.

Here is my approach,

import urllib2

for i in l:
    print urllib2.urlopen(i).url

But when list contains thousands of url , the program takes long time.

My question : Is there is any way to reduce execution time or any other approach I have to follow ?

like image 686
Nishant Nawarkhede Avatar asked Aug 11 '14 14:08

Nishant Nawarkhede


2 Answers

First method

As suggested, one way to accomplish the task would be to use the official api to bitly, which has, however, limitations (e.g., no more than 15 shortUrl's per request).

Second method

As an alternative, one could just avoid getting the contents, e.g. by using the HEAD HTTP method instead of GET. Here is just a sample code, which makes use of the excellent requests package:

import requests

l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']

for i in l:
    print requests.head("http://"+i).headers['location']
like image 131
Roberto Reale Avatar answered Sep 30 '22 16:09

Roberto Reale


I'd try twisted's asynchronous web client. Be careful with this, though, it doesn't rate-limit at all.

#!/usr/bin/python2.7

from twisted.internet import reactor
from twisted.internet.defer import Deferred, DeferredList, DeferredLock
from twisted.internet.defer import inlineCallbacks
from twisted.web.client import Agent, HTTPConnectionPool
from twisted.web.http_headers import Headers
from pprint import pprint
from collections import defaultdict
from urlparse import urlparse
from random import randrange
import fileinput

pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = 16
agent = Agent(reactor, pool)
locks = defaultdict(DeferredLock)
locations = {}

def getLock(url, simultaneous = 1):
    return locks[urlparse(url).netloc, randrange(simultaneous)]

@inlineCallbacks
def getMapping(url):
    # Limit ourselves to 4 simultaneous connections per host
    # Tweak this as desired, but make sure that it no larger than
    # pool.maxPersistentPerHost
    lock = getLock(url,4)
    yield lock.acquire()
    try:
        resp = yield agent.request('HEAD', url)
        locations[url] = resp.headers.getRawHeaders('location',[None])[0]
    except Exception as e:
        locations[url] = str(e)
    finally:
        lock.release()


dl = DeferredList(getMapping(url.strip()) for url in fileinput.input())
dl.addCallback(lambda _: reactor.stop())

reactor.run()
pprint(locations)
like image 25
Robᵩ Avatar answered Sep 30 '22 15:09

Robᵩ