I'm trying to scrape data from this website, using httr
and rvest
. After several times of scraping (around 90 - 100), the website will automatically transfer me to another url with captcha.
this is the normal url: "https://fs.lianjia.com/ershoufang/pg1"
this is the captcha url: "http://captcha.lianjia.com/?redirect=http%3A%2F%2Ffs.lianjia.com%2Fershoufang%2Fpg1"
When my spider comes accross captcha url, it will tell me to stop and solve it in browser. Then I solve it by hand in browser. But when I run the spider and send GET
request, the spider is still transferred to captcha url. Meanwhile in browser, everything goes normal, even I type in the captcha url, it will transfer me back to the normal url in browser.
Even I use proxy, I still got the same problem. In browser, I can normally browse the website, while the spider kept being transferred to captcha url.
I was wondering,
Thanks.
This is my code:
a <- GET(url, use_proxy(proxy, port), timeout(10),
add_headers('User-Agent' = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Connection' = 'keep-alive',
'Accept-Language' = 'en-GB,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,en-US;q=0.2,fr;q=0.2,zh-TW;q=0.2',
'Accept-Encoding' = 'gzip, deflate, br',
'Host' = 'ajax.api.lianjia.com',
'Accept' = '*/*',
'Accept-Charset' = 'GBK,utf-8;q=0.7,*;q=0.3',
'Cache-Control' = 'max-age=0'))
b <- a %>% read_html %>% html_nodes('div.leftContent') %>% html_nodes('div.info.clear') %>%
html_nodes('div.title') %>% html_text()
Finally, I turned to RSelenium, it's slow but no more captchas. Even when it appears, I can directly solve it in the browser.
You are getting CAPTCHAs because that is the way website is trying to prevent non-human/programming script scrapping their data. So, when you are trying to scrape the data, it's detecting you as non-human/robotic script. The reason why this is happening because your script sending very frequent GET request along with some parameters data. Your program need to behave like a real user (Visiting website in random time pattern, different browsers, and IP).
You can avoid getting CAPTCHA by manipulating with these parameters as below. So your program would appear like a real user:
Use randomness when sending GET request. Like you can use Sys.sleep
function (use random distribution) to sleep before sending each GET request.
Manipulate user agent data(Mozilla, Chrome, IE etc), cookie acceptance, and encoding.
Manipulate your source location (ip address, and server info)
Manipulating these information will help you to avoid getting CAPTACHA validation in some way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With