Requests using Beautiful Soup gets blocked

Tags:

web-scraping

When I make requests using Beautiful Soup, I get blocked as a "bot".

import requests
from bs4 import BeautifulSoup

reddit1Link = requests.get("https://www.reddit.com/r/tensorflow/comments/650p49/question_im_a_techy_35_year_old_and_i_think_ai_is/")
reddit1Content =BeautifulSoup(reddit1Link.content,"lxml")
print(reddit1Content)

Then I get messages from Reddit saying that they suspected me as a bot.

What are possible solutions through Beautiful Soup? (I have tried Scrapy to use its Crawlera, but due to my lack of python knowledge, I cannot use it.) I don't mind if it is a paid service, as long as it is "intuitive" enough for a beginner to use.

799

asked Apr 16 '17 18:04

CottonCandy

1 Answers

There can be various reasons for being blocked as a bot.

As you are using the requests library "as is", the most probable reason for the block is a missing User Agent header.

A first line of defense against bots and scraping is to check the User Agent header for being from one of the major browsers and block all non-browser user agents.

Short version: try this:

import requests
from bs4 import BeautifulSoup

headers = requests.utils.default_headers()
headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})

reddit1Link = requests.get("https://www.reddit.com/r/tensorflow/comments/650p49/question_im_a_techy_35_year_old_and_i_think_ai_is/", headers=headers)
reddit1Content =BeautifulSoup(reddit1Link.content,"lxml")
print(reddit1Content)

Detailled explanation: Sending "User-agent" using Requests library in Python

answered Sep 20 '22 16:09

Done Data Solutions

Related questions
                            
                                Scrapyd-deploy command not found after scrapyd installation
                            
                                Python Xpath: lxml.etree.XPathEvalError: Invalid predicate
                            
                                Using Scrapy Itemloader in a loop
                            
                                How to know if the website being scraped has changed?
                            
                                How can I loop scraping data for multiple pages in a website using python and beautifulsoup4
                            
                                Spoofing IP address when web scraping (python)
                            
                                Unable to print names in the right way in another function
                            
                                Scraping a wiki page for the "Periodic table" and all the links
                            
                                Go Parse HTML table
                            
                                Replace <br> with space in BeautifulSoap output
                            
                                Crawling a site recursively using scrapy
                            
                                post request using python to asp.net page
                            
                                Click button on website then scrape web page
                            
                                How to fill a form with Jsoup?
                            
                                Scrapy set depth limit per allowed_domains
                            
                                Scrape and convert website into HTML? [closed]
                            
                                using Perl to scrape a website
                            
                                Python BeautifulSoup Extract specific URLs
                            
                                Can't download HTML data from https URL using htmlagilitypack
                            
                                Python data scraping with Scrapy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With