I'm doing web scraping as part of an academic project, where it's important that all links are followed through to the actual content. Annoyingly, there are some important error cases with "social media management" sites, where users post their links to detect who clicks on them.
For instance, consider this link on linkis.com, which links to http:// + bit.ly + /1P1xh9J (separated link due to SO posting restrictions), which in turn links to http://conservatives4palin.com. The issue arises as the original link at linkis.com does not automatically redirect forward. Instead, the user has to click the cross in the top right corner to go to the original URL.
Furthermore, there seems to be different variations (see e.g. linkis.com link 2, where the cross is at the bottom left of the website). These are the only two variations I've found, but there might be more. Note that I'm using a web scraper very similar to this one. The functionality to go through to the actual link does not need to be stable/functioning over time as this is a one-time academic project.
How do I automatically go on to the original URL? Would the best approach be to design a regex that finds the relevant link?
In many cases, you will have to use browser automation to scrape web pages that generate their content using javascript, scraping the html returned by the a get request will not yield the result you want, you have two options here :
I have been developing bots and scrapers for years now, and unless the webpage you are requesting does not rely heavily on javascript, you should use something like selenium.
Here is some code to get you started with selenium:
from selenium import webdriver
#Create a chrome browser instance, other drivers are also available
driver = webdriver.Chrome()
#Request a page
driver.get('http://linkis.com/conservatives4palin.com/uGXam')
#Select elements on the page and trigger events
#Selenium supports also xpath and css selectors
#Clicks the tag with the given id
driver.find_elements_by_id('some_id').click()
The common architecture that the website follows is that it shows the website as an iframe. The sample code runs for both the cases.
In order to get the final URL you can do something like this:
import requests
from bs4 import BeautifulSoup
urls = ["http://linkis.com/conservatives4palin.com/uGXam", "http://linkis.com/paper.li/gsoberon/jozY2"]
response_data = []
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
short_url = soup.find("iframe", {"id": "source_site"})['src']
response_data.append(requests.get(short_url).url)
print(response_data)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With