When I make requests using Beautiful Soup, I get blocked as a "bot".
import requests
from bs4 import BeautifulSoup
reddit1Link = requests.get("https://www.reddit.com/r/tensorflow/comments/650p49/question_im_a_techy_35_year_old_and_i_think_ai_is/")
reddit1Content =BeautifulSoup(reddit1Link.content,"lxml")
print(reddit1Content)
Then I get messages from Reddit saying that they suspected me as a bot.
What are possible solutions through Beautiful Soup? (I have tried Scrapy to use its Crawlera, but due to my lack of python knowledge, I cannot use it.) I don't mind if it is a paid service, as long as it is "intuitive" enough for a beginner to use.
To avoid this, you can use rotating proxies. A rotating proxy is a proxy server that allocates a new IP address from a set of proxies stored in the proxy pool. We need to use proxies and rotate our IP addresses in order to avoid getting detected by the website owners.
That's because the HTML usually contains all the information in the page. CSS is used to perform styling, and our scraping programs don't care what the page looks like.
The reason why request might be blocked is that, for example in Python requests library, default user-agent is python-requests and websites understands that it's a bot and might block a request in order to protect the website from overload, if there's a lot of requests being sent.
Like urllib2 , requests is blocking. But I wouldn't suggest using another library, either. The simplest answer is to run each request in a separate thread.
There can be various reasons for being blocked as a bot.
As you are using the requests library "as is", the most probable reason for the block is a missing User Agent header.
A first line of defense against bots and scraping is to check the User Agent header for being from one of the major browsers and block all non-browser user agents.
Short version: try this:
import requests
from bs4 import BeautifulSoup
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
reddit1Link = requests.get("https://www.reddit.com/r/tensorflow/comments/650p49/question_im_a_techy_35_year_old_and_i_think_ai_is/", headers=headers)
reddit1Content =BeautifulSoup(reddit1Link.content,"lxml")
print(reddit1Content)
Detailled explanation: Sending "User-agent" using Requests library in Python
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With