Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to deal with captcha when web scraping using R

I'm trying to scrape data from this website, using httr and rvest. After several times of scraping (around 90 - 100), the website will automatically transfer me to another url with captcha.

this is the normal url: "https://fs.lianjia.com/ershoufang/pg1"

this is the captcha url: "http://captcha.lianjia.com/?redirect=http%3A%2F%2Ffs.lianjia.com%2Fershoufang%2Fpg1"

When my spider comes accross captcha url, it will tell me to stop and solve it in browser. Then I solve it by hand in browser. But when I run the spider and send GET request, the spider is still transferred to captcha url. Meanwhile in browser, everything goes normal, even I type in the captcha url, it will transfer me back to the normal url in browser.

Even I use proxy, I still got the same problem. In browser, I can normally browse the website, while the spider kept being transferred to captcha url.

I was wondering,

  1. Is my way of using proxy correct?
  2. Why the spider keeps being transferred while browser doesn't. They are from the same IP.

Thanks.

This is my code:

a <- GET(url, use_proxy(proxy, port), timeout(10),
          add_headers('User-Agent' = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
                      'Connection' = 'keep-alive',
                      'Accept-Language' = 'en-GB,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,en-US;q=0.2,fr;q=0.2,zh-TW;q=0.2',
                      'Accept-Encoding' = 'gzip, deflate, br',
                      'Host' = 'ajax.api.lianjia.com',
                      'Accept' = '*/*',
                      'Accept-Charset' = 'GBK,utf-8;q=0.7,*;q=0.3',
                      'Cache-Control' = 'max-age=0'))
b <- a %>% read_html %>% html_nodes('div.leftContent') %>% html_nodes('div.info.clear') %>% 
            html_nodes('div.title') %>% html_text()

Finally, I turned to RSelenium, it's slow but no more captchas. Even when it appears, I can directly solve it in the browser.

like image 537
rankthefirst Avatar asked Oct 29 '22 01:10

rankthefirst


1 Answers

You are getting CAPTCHAs because that is the way website is trying to prevent non-human/programming script scrapping their data. So, when you are trying to scrape the data, it's detecting you as non-human/robotic script. The reason why this is happening because your script sending very frequent GET request along with some parameters data. Your program need to behave like a real user (Visiting website in random time pattern, different browsers, and IP).

You can avoid getting CAPTCHA by manipulating with these parameters as below. So your program would appear like a real user:

  1. Use randomness when sending GET request. Like you can use Sys.sleep function (use random distribution) to sleep before sending each GET request.

  2. Manipulate user agent data(Mozilla, Chrome, IE etc), cookie acceptance, and encoding.

  3. Manipulate your source location (ip address, and server info)

Manipulating these information will help you to avoid getting CAPTACHA validation in some way.

like image 59
Santosh M. Avatar answered Nov 15 '22 06:11

Santosh M.