Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selenium HtmlUnitDriver Web Scrape Got Captcha Page From EC2 Server

I wrote a simple web scraper to scrape expedia.com. Using Java Selenium HtmlUnitDriver, i was able to successfully scrape data from the site if i run it locally.

However, when i deploy this on to an EC2 Server, it always returns me the page where expedia detected it as a bot, thus, it displays this captcha to prove a human is accessing it.

I think it might have something to do with ip address of ec2 servers which got blacklisted by expedia.com somehow?

I've tried scraping different websites where they don't care / don't do human test.

Any idea how to go around this?

Things I tried but still detected as bot:

  • Changing user agent to something i use on my local browser
  • Setting a proxy

Update: Actually setting a proxy server gives me a different error:

Current URL is https://www.expedia.com/things-to-do/search?location=Paris&pageNumber=1

The htmlString:

<!--?xml version="1.0" encoding="ISO-8859-1"?-->
<html>
 <head> 
  <title>
      500 Internal Server Error
    </title> 
 </head> 
 <body> 
  <h1> Internal Server Error </h1> 
  <p> The server encountered an internal error or misconfiguration and was unable to complete your request. </p> 
  <p> Please contact the server administrator at [no address given] to inform them of the time this error occurred, and the actions you performed just before this error. </p> 
  <p> More information about this error may be available in the server error log. </p> 
  <hr> 
  <address> Apache/2.4.18 (Ubuntu) Server at www.expedia.com Port 443 </address>   
 </body>
</html>
like image 534
user1955934 Avatar asked Aug 01 '18 13:08

user1955934


1 Answers

Are you covering these topics:

-Which agent are you using? Make sure you are using the same agent which you would use in a human navigation, more details in this link.

-Are you inserting waits in your navigation? If as soon as a page load you try to click or navigate, this isn't simulating a regular navigation. More details.

-Which driver are you using, there is a trick with chromedriver to rename a internal variable "cdc_" to other name like "aaa_" then if there is a javascript code in the server trying to detect this variable (cdc_), it will fail. More details.

-There are more things to be studied if you really need to not be detected by the server:

-Is there a honeypot in place?
-Are your IP (EC2 IP) already blocked? You could redirect using a VPN tunnel.

Interesting articles:

https://www.kdnuggets.com/2018/02/web-scraping-tutorial-python.html

https://antoinevastel.com/bot%20detection/2017/08/05/detect-chrome-headless.html

https://intoli.com/blog/making-chrome-headless-undetectable/

like image 73
bkemmer Avatar answered Sep 22 '22 11:09

bkemmer