Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Go through to original URL on social media management websites

I'm doing web scraping as part of an academic project, where it's important that all links are followed through to the actual content. Annoyingly, there are some important error cases with "social media management" sites, where users post their links to detect who clicks on them.

For instance, consider this link on linkis.com, which links to http:// + bit.ly + /1P1xh9J (separated link due to SO posting restrictions), which in turn links to http://conservatives4palin.com. The issue arises as the original link at linkis.com does not automatically redirect forward. Instead, the user has to click the cross in the top right corner to go to the original URL.

Furthermore, there seems to be different variations (see e.g. linkis.com link 2, where the cross is at the bottom left of the website). These are the only two variations I've found, but there might be more. Note that I'm using a web scraper very similar to this one. The functionality to go through to the actual link does not need to be stable/functioning over time as this is a one-time academic project.

How do I automatically go on to the original URL? Would the best approach be to design a regex that finds the relevant link?

like image 476
pir Avatar asked Jun 20 '17 16:06

pir


2 Answers

In many cases, you will have to use browser automation to scrape web pages that generate their content using javascript, scraping the html returned by the a get request will not yield the result you want, you have two options here :

  • Try to get your way around all the additional javascript requests to get the content you want which can be very time consuming .
  • Use browser automation, which lets you open a real browser and automates its tasks, you can use Selenium for that.

I have been developing bots and scrapers for years now, and unless the webpage you are requesting does not rely heavily on javascript, you should use something like selenium.

Here is some code to get you started with selenium:

from selenium import webdriver

#Create a chrome browser instance, other drivers are also available
driver = webdriver.Chrome()     

#Request a page
driver.get('http://linkis.com/conservatives4palin.com/uGXam')

#Select elements on the page and trigger events
#Selenium supports also xpath and css selectors
#Clicks the tag with the given id
driver.find_elements_by_id('some_id').click()
like image 110
SEDaradji Avatar answered Oct 02 '22 15:10

SEDaradji


The common architecture that the website follows is that it shows the website as an iframe. The sample code runs for both the cases.

In order to get the final URL you can do something like this:

import requests                                                                                                                                                                                        
from bs4 import BeautifulSoup                                                                                                                                                                          

urls = ["http://linkis.com/conservatives4palin.com/uGXam", "http://linkis.com/paper.li/gsoberon/jozY2"]                                                                                                
response_data = []                                                                                                                                                                                     

for url in urls:                                                                                                                                                                                       
    response = requests.get(url)                                                                                                                                                                       
    soup = BeautifulSoup(response.text, 'html.parser')                                                                                                                                                 
    short_url = soup.find("iframe", {"id": "source_site"})['src']                                                                                                                                      
    response_data.append(requests.get(short_url).url)                                                                                                                                                  

print(response_data)
like image 35
Shobhit Sharma Avatar answered Oct 02 '22 16:10

Shobhit Sharma