Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Requests using Beautiful Soup gets blocked

When I make requests using Beautiful Soup, I get blocked as a "bot".

import requests
from bs4 import BeautifulSoup

reddit1Link = requests.get("https://www.reddit.com/r/tensorflow/comments/650p49/question_im_a_techy_35_year_old_and_i_think_ai_is/")
reddit1Content =BeautifulSoup(reddit1Link.content,"lxml")
print(reddit1Content)

Then I get messages from Reddit saying that they suspected me as a bot.

What are possible solutions through Beautiful Soup? (I have tried Scrapy to use its Crawlera, but due to my lack of python knowledge, I cannot use it.) I don't mind if it is a paid service, as long as it is "intuitive" enough for a beginner to use.

like image 799
CottonCandy Avatar asked Apr 16 '17 18:04

CottonCandy


People also ask

How do I stop scraping when I get blocked?

To avoid this, you can use rotating proxies. A rotating proxy is a proxy server that allocates a new IP address from a set of proxies stored in the proxy pool. We need to use proxies and rotate our IP addresses in order to avoid getting detected by the website owners.

Why does scraping fail with requests and BeautifulSoup?

That's because the HTML usually contains all the information in the page. CSS is used to perform styling, and our scraping programs don't care what the page looks like.

Can websites block Python requests?

The reason why request might be blocked is that, for example in Python requests library, default user-agent is python-requests and websites understands that it's a bot and might block a request in order to protect the website from overload, if there's a lot of requests being sent.

Are Python requests blocked?

Like urllib2 , requests is blocking. But I wouldn't suggest using another library, either. The simplest answer is to run each request in a separate thread.


1 Answers

There can be various reasons for being blocked as a bot.

As you are using the requests library "as is", the most probable reason for the block is a missing User Agent header.

A first line of defense against bots and scraping is to check the User Agent header for being from one of the major browsers and block all non-browser user agents.

Short version: try this:

import requests
from bs4 import BeautifulSoup

headers = requests.utils.default_headers()
headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})

reddit1Link = requests.get("https://www.reddit.com/r/tensorflow/comments/650p49/question_im_a_techy_35_year_old_and_i_think_ai_is/", headers=headers)
reddit1Content =BeautifulSoup(reddit1Link.content,"lxml")
print(reddit1Content)

Detailled explanation: Sending "User-agent" using Requests library in Python

like image 51
Done Data Solutions Avatar answered Sep 20 '22 16:09

Done Data Solutions