I'm trying to create a script using which I can parse few fields from a website without getting blocked. The site I wish to get data from requires credentials to access it's content. If it were not for login thing, I could have bypassed the rate limit using rotation of proxies.
As I'm scraping content from a login based site, I'm trying to figure out any way to avoid being banned by that site while scraping data from there. To be specific, my script currently can fetch content from that site flawlessly but my ip address gets banned along the way if I keep on scraping.
I've written so far (consider the following site address to be a placeholder):
import requests
from bs4 import BeautifulSoup
url = "https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f"
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
req = s.get(url)
payload = {
"fkey": BeautifulSoup(req.text,"lxml").select_one("[name='fkey']")["value"],
"email": "some email",
"password": "some password",
}
res = s.post(url,data=payload)
soup = BeautifulSoup(res.text,"lxml")
for post_title in soup.select(".summary > h3 > a.question-hyperlink"):
print(post_title.text)
How can I avoid being banned while scraping data from a login based site?
Going direct to the point of "any efficient way to avoid being banned" there is no way.
I would compare this situation with a shark attack. It was the shark's decision, not yours.
However, there are some techniques that you can use to mitigate the "shark attack"... But first, let's make it clear that you are "attacking" the shark first, swimming in its domain.
The technique would be: "Creating a human scraping script".
The word human here is referred to make random mistakes sometimes. Some of them listed below:
However, the most effective way would be to contact the website owner and offer a partnership or pay for accessing the data using an API or something like that if they have this service.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With