I'm trying to code a scraper in Python to get some info from a page. Like the title of the offers that appear on this page:
https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585
By now I use this code :
import bs4
import requests
def extract_source(url):
source=requests.get(url).text
return source
def extract_data(source):
soup=bs4.BeautifulSoup(source)
names=soup.findAll('title')
for i in names:
print i
extract_data(extract_source('https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585'))
But when I execute this code, it gives me an error:
<titlee> Access Denied</titlee>
What can I do to solve this?
Easy Way To Solve 403 Forbidden Errors When Web Scraping To avoid getting detected we need to optimise our spiders to bypass anti-bot countermeasures by: Using Fake User Agents. Optimizing Request Headers. Using Proxies.
Instead of looking at the job site every day, you can use Python to help automate your job search's repetitive parts. Automated web scraping can be a solution to speed up the data collection process. You write your code once, and it will get the information you want many times and from many pages.
As was mentioned in comments, you need to specify allowable user-agent and pass it as headers
:
def extract_source(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
source=requests.get(url, headers=headers).text
return source
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With