I am trying to scrape/extract the website of the company/hotel from Tripadvisor.com webpages. I don't see the website url when I inspect the page. Any idea on how I can extract the website url using python? Apologies in advance as I have only recently started 'web scraping in Python.' Thank you.
E.g. Please see the two red arrows in the image. When I select the website link it takes me to 'http://www.i-love-my-india.com/' - which is what I want to extract using Python.
Tripadvisor url
Try this one using Selenium :
import time
from selenium import webdriver
browser = webdriver.Firefox(executable_path="C:\\Users\\Vader\\geckodriver.exe")
# Must install geckodriver (handles your browser)- see instructions on
# http://selenium-python.readthedocs.io/installation.html.
# Change the path to where your geckodriver file is.
browser.get('https://www.tripadvisor.co.uk/Attraction_Review-g304551-d4590508-Reviews-Ashok_s_Taxi_Tours-New_Delhi_National_Capital_Territory_of_Delhi.html')
browser.find_element_by_css_selector('.blEntry.website').click()
#browser.window_handles # Results is 2 tabs opened.
browser.switch_to.window(browser.window_handles[1]) # changes the browser to
# the second one
time.sleep(1) # When I went directly I was getting a 'blank' result, so I put
# a little delay and it worked (I really do not know why).
res = browser.current_url # the URL
print(res)
browser.quit() # Closes the browser
Selenium
If you take a look at the element, you'll notice that the redirect URL is there (data-ahref
attribute), but it's encoded and is decoded somewhere in the JS sources. Unfortunately, they are minified and obfuscated, so finding the decoder function will be hard. You thus have two options:
This is what Roberval _T_ suggested in his answer: click on the element, wait some time for the page to be loaded in another tab, grab the URL. This is a perfectly valid answer that deserves an upvote in my opinion, however here's a little technique I always try when the desired data is unavailable for some reason:
The obvious advantage of scraping the mobile pages is that they are more lightweight than the desktop ones. But often, the mobile website also has the data present when the desktop version tries to hide the data for some reason. In this case, all the infos (address, homepage, phone) in the mobile version can be grabbed immediately without loading the URL explicitly. Here's how the page looks like when I run selenium with a mobile user agent:
An example code using IPhone's user agent:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = 'https://www.tripadvisor.co.uk/Attraction_Review-g304551-d4590508-Reviews-Ashok_s_Taxi_Tours-New_Delhi_National_Capital_Territory_of_Delhi.html'
chrome_options = Options()
chrome_options.add_argument('--user-agent=Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1')
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(url)
element = driver.find_element_by_css_selector('div.website.contact_link')
link = element.text
driver.quit()
print(link)
I would recommend using selenium.
My answer can be seen as a improvement on what @Roberval T suggested. I consider his answer very good for this particular case.
This is my solution. I will point out some of the differences and why I think you should consider them:
import sys
# Selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
# I would use argparse for example
try:
assert len(sys.argv) == 2
url = sys.argv[1]
except AssertionError:
# Invalid arguments
sys.exit()
# Set up the driver
driver = webdriver.Chrome()
driver.get(url)
# Try to load the page a wait until it's loaded
try:
poll_frequency = 5
data_section_id = "taplc_location_detail_header_attractions_0"
data_section = WebDriverWait(driver, poll_frequency).until(EC.presence_of_element_located((By.ID, data_section_id)))
except TimeoutException:
# Could not load page
sys.exit()
# Get the third child ( relative to the data section div that we get by ID )
try:
third_child = data_section.find_elements_by_xpath("./*")[2]
except IndexError:
sys.exit()
# Get the child immediatly under that ( that's how the structure looks)
container_div = third_child.find_elements_by_xpath("./*")[0]
clickable_element = container_div.find_elements_by_xpath("./*")[3]
# Click the node
clickable_element.click()
# Switch tabs
driver.switch_to.window(driver.window_handles[1])
try:
new_page = WebDriverWait(driver, poll_frequency).until(EC.presence_of_element_located((By.TAG_NAME, "body")))
except TimeoutError:
sys.exit()
print(driver.current_url)
assert driver.current_url == "http://www.i-love-my-india.com/"
driver.quit()
First, in my opinion, you should use selenium's
specific waiting mechanisms instead of time.sleep()
. This will allow you to better fine tune your scraper and it will also make it more reliable. I would suggest you look into WebDriverWait
Second, my personal preference is to avoid using class selectors
. I am not saying they are wrong. But experience has showed me they can change easily and often times the same class is used in multiple places ( that's why it's a class ). In this particular case, selecting using the CSS class
works because that class is used in a single place.
What happens if in the next release, the same class is used in another place?
While following the structure is no guarantee either, it's probably less likely to change.
Use Chrome
. Since version 59
, Google Chrome has a headless
option. It is much easier to work with then Firefox
, in my opinion. Going with Firefox
will require you to install and run a x server
service on the production machine and connect the Firefox
instance to that server through the geckodriver
. You skip all this with Chrome
.
I hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With