I have just scraped a bunch of Google Buzz data, and I want to know which Buzz posts reference the same news articles. The problem is that many of the links in these posts have been modified by URL shorteners, so it could be the case that many distinct shortened URLs actually all point to the same news article.
Given that I have millions of posts, what is the most efficient way (preferably in python) for me to
Does anyone know if the URL shorteners impose strict request rate limits? If I keep this down to 100/second (all coming form the same IP address), do you think I'll run into trouble?
UPDATE & PRELIMINARY SOLUTION The responses have led to to the following simple solution
import urllib2
response = urllib2.urlopen("http://bit.ly/AoifeMcL_ID3") # Some shortened url
url_destination = response.url
That's it!
In addition, at the time you create your tiny URL you can activate a tracking option that will enable you to gather statistics about how many people click on the URL that you've created.
Type the shortened URL in the address bar of your web browser and add the characters described below to see a preview of the full URL: tinyurl.com. Between the "http://" and the "tinyurl," type preview. bit.ly.
add the word "preview" before ". tinyurl": https://preview.tinyurl.com/SmallBizness. to safely display the original URL without the need to go to the actual site.
Navigate to the account for which you want to remove the URL Shortener and click the Edit icon. You can remove the assigned URL Shortener here by clicking on the x mark next to the shortener.
The easiest way to get the destination of a shortened URL is with urllib
. Given that the short URL is valid (response code 200), the URL be returned to you.
>>> import urllib
>>> resp = urllib.urlopen('http://bit.ly/bcFOko')
>>> resp.getcode()
200
>>> resp.url
'http://mrdoob.com/lab/javascript/harmony/'
And that's that!
(AFAIK) Most url shorteners keep track of urls already shortened, so several requests to the same engine with the same URL will return the same short code.
As has been suggested, the best way to extract the real url is to read the headers from a response to a request for the shortened URL. However, some shortening services (eg bit.ly) provide an API method to return the long url
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With