I wrote a simple web scraper to scrape expedia.com. Using Java Selenium HtmlUnitDriver, i was able to successfully scrape data from the site if i run it locally.
However, when i deploy this on to an EC2 Server, it always returns me the page where expedia detected it as a bot, thus, it displays this captcha to prove a human is accessing it.
I think it might have something to do with ip address of ec2 servers which got blacklisted by expedia.com somehow?
I've tried scraping different websites where they don't care / don't do human test.
Any idea how to go around this?
Things I tried but still detected as bot:
Update: Actually setting a proxy server gives me a different error:
Current URL is https://www.expedia.com/things-to-do/search?location=Paris&pageNumber=1
The htmlString:
<!--?xml version="1.0" encoding="ISO-8859-1"?-->
<html>
<head>
<title>
500 Internal Server Error
</title>
</head>
<body>
<h1> Internal Server Error </h1>
<p> The server encountered an internal error or misconfiguration and was unable to complete your request. </p>
<p> Please contact the server administrator at [no address given] to inform them of the time this error occurred, and the actions you performed just before this error. </p>
<p> More information about this error may be available in the server error log. </p>
<hr>
<address> Apache/2.4.18 (Ubuntu) Server at www.expedia.com Port 443 </address>
</body>
</html>
Are you covering these topics:
-Which agent are you using? Make sure you are using the same agent which you would use in a human navigation, more details in this link.
-Are you inserting waits in your navigation? If as soon as a page load you try to click or navigate, this isn't simulating a regular navigation. More details.
-Which driver are you using, there is a trick with chromedriver to rename a internal variable "cdc_" to other name like "aaa_" then if there is a javascript code in the server trying to detect this variable (cdc_), it will fail. More details.
-There are more things to be studied if you really need to not be detected by the server:
-Is there a honeypot in place?
-Are your IP (EC2 IP) already blocked? You could redirect using a VPN tunnel.
Interesting articles:
https://www.kdnuggets.com/2018/02/web-scraping-tutorial-python.html
https://antoinevastel.com/bot%20detection/2017/08/05/detect-chrome-headless.html
https://intoli.com/blog/making-chrome-headless-undetectable/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With