Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape data from a website when linked to event clicks?

I am trying to scrape/extract the website of the company/hotel from Tripadvisor.com webpages. I don't see the website url when I inspect the page. Any idea on how I can extract the website url using python? Apologies in advance as I have only recently started 'web scraping in Python.' Thank you.

E.g. Please see the two red arrows in the image. When I select the website link it takes me to 'http://www.i-love-my-india.com/' - which is what I want to extract using Python.

Tripadvisor url enter image description here

like image 673
keepsmilinyaar Avatar asked Feb 01 '18 00:02

keepsmilinyaar


3 Answers

Try this one using Selenium :

import time
from selenium import webdriver

browser = webdriver.Firefox(executable_path="C:\\Users\\Vader\\geckodriver.exe")
# Must install geckodriver (handles your browser)- see instructions on
# http://selenium-python.readthedocs.io/installation.html.
# Change the path to where your geckodriver file is.

browser.get('https://www.tripadvisor.co.uk/Attraction_Review-g304551-d4590508-Reviews-Ashok_s_Taxi_Tours-New_Delhi_National_Capital_Territory_of_Delhi.html')
browser.find_element_by_css_selector('.blEntry.website').click()

#browser.window_handles # Results is 2 tabs opened. 

browser.switch_to.window(browser.window_handles[1]) # changes the browser to 
                                                    # the second one

time.sleep(1) # When I went directly I was getting a 'blank' result, so I put
              # a little delay and it worked (I really do not know why).

res = browser.current_url # the URL

print(res)

browser.quit() # Closes the browser

Selenium

like image 130
Roberval _T_ Avatar answered Nov 15 '22 02:11

Roberval _T_


If you take a look at the element, you'll notice that the redirect URL is there (data-ahref attribute), but it's encoded and is decoded somewhere in the JS sources. Unfortunately, they are minified and obfuscated, so finding the decoder function will be hard. You thus have two options:

Follow redirects

This is what Roberval _T_ suggested in his answer: click on the element, wait some time for the page to be loaded in another tab, grab the URL. This is a perfectly valid answer that deserves an upvote in my opinion, however here's a little technique I always try when the desired data is unavailable for some reason:

Scrape the mobile webpage

The obvious advantage of scraping the mobile pages is that they are more lightweight than the desktop ones. But often, the mobile website also has the data present when the desktop version tries to hide the data for some reason. In this case, all the infos (address, homepage, phone) in the mobile version can be grabbed immediately without loading the URL explicitly. Here's how the page looks like when I run selenium with a mobile user agent:

enter image description here

An example code using IPhone's user agent:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = 'https://www.tripadvisor.co.uk/Attraction_Review-g304551-d4590508-Reviews-Ashok_s_Taxi_Tours-New_Delhi_National_Capital_Territory_of_Delhi.html'
chrome_options = Options()
chrome_options.add_argument('--user-agent=Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1')
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(url)

element = driver.find_element_by_css_selector('div.website.contact_link')
link = element.text
driver.quit()
print(link)
like image 34
hoefling Avatar answered Nov 15 '22 03:11

hoefling


I would recommend using selenium.

My answer can be seen as a improvement on what @Roberval T suggested. I consider his answer very good for this particular case.

This is my solution. I will point out some of the differences and why I think you should consider them:

import sys

# Selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException


# I would use argparse for example
try:
    assert len(sys.argv) == 2
    url = sys.argv[1]
except AssertionError:
    # Invalid arguments
    sys.exit()

# Set up the driver
driver = webdriver.Chrome()
driver.get(url)

# Try to load the page a wait until it's loaded
try:
    poll_frequency = 5
    data_section_id = "taplc_location_detail_header_attractions_0"
    data_section = WebDriverWait(driver, poll_frequency).until(EC.presence_of_element_located((By.ID, data_section_id)))
except TimeoutException:
    # Could not load page
    sys.exit()

# Get the third child ( relative to the data section div that we get by ID )
try:
    third_child = data_section.find_elements_by_xpath("./*")[2]
except IndexError:
    sys.exit()

# Get the child immediatly under that ( that's how the structure looks)
container_div = third_child.find_elements_by_xpath("./*")[0]
clickable_element = container_div.find_elements_by_xpath("./*")[3]

# Click the node
clickable_element.click()

# Switch tabs
driver.switch_to.window(driver.window_handles[1])

try:
    new_page = WebDriverWait(driver, poll_frequency).until(EC.presence_of_element_located((By.TAG_NAME, "body")))
except TimeoutError:
    sys.exit()

print(driver.current_url)
assert driver.current_url == "http://www.i-love-my-india.com/"

driver.quit()
  1. First, in my opinion, you should use selenium's specific waiting mechanisms instead of time.sleep(). This will allow you to better fine tune your scraper and it will also make it more reliable. I would suggest you look into WebDriverWait

  2. Second, my personal preference is to avoid using class selectors. I am not saying they are wrong. But experience has showed me they can change easily and often times the same class is used in multiple places ( that's why it's a class ). In this particular case, selecting using the CSS class works because that class is used in a single place.

    • What happens if in the next release, the same class is used in another place?

    • While following the structure is no guarantee either, it's probably less likely to change.

  3. Use Chrome. Since version 59, Google Chrome has a headless option. It is much easier to work with then Firefox, in my opinion. Going with Firefox will require you to install and run a x server service on the production machine and connect the Firefox instance to that server through the geckodriver. You skip all this with Chrome.


I hope this helps!

like image 2
mayk93 Avatar answered Nov 15 '22 03:11

mayk93