Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid being banned while scraping data from a login based site?

I'm trying to create a script using which I can parse few fields from a website without getting blocked. The site I wish to get data from requires credentials to access it's content. If it were not for login thing, I could have bypassed the rate limit using rotation of proxies.

As I'm scraping content from a login based site, I'm trying to figure out any way to avoid being banned by that site while scraping data from there. To be specific, my script currently can fetch content from that site flawlessly but my ip address gets banned along the way if I keep on scraping.

I've written so far (consider the following site address to be a placeholder):

import requests
from bs4 import BeautifulSoup

url = "https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f"

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    req = s.get(url)

    payload = {
        "fkey": BeautifulSoup(req.text,"lxml").select_one("[name='fkey']")["value"],
        "email": "some email",
        "password": "some password",
    }
    
    res = s.post(url,data=payload)
    soup = BeautifulSoup(res.text,"lxml")
    for post_title in soup.select(".summary > h3 > a.question-hyperlink"):
        print(post_title.text)

How can I avoid being banned while scraping data from a login based site?

like image 367
SMTH Avatar asked Oct 20 '25 15:10

SMTH


1 Answers

Going direct to the point of "any efficient way to avoid being banned" there is no way.

I would compare this situation with a shark attack. It was the shark's decision, not yours.

However, there are some techniques that you can use to mitigate the "shark attack"... But first, let's make it clear that you are "attacking" the shark first, swimming in its domain.

The technique would be: "Creating a human scraping script".

The word human here is referred to make random mistakes sometimes. Some of them listed below:

  • Insert some random delays between your tasks;
  • Click on some wrong link, wait few seconds, go back;
  • Log you out from the system, wait a minute or two, log you in again;
  • If you have a list of links on a page to click and then grab the data for each page, don't do it in order;
  • If you have a page that shows the results in pages, get the results don't do it in order (ex. 1, 5, 2, 9, 10, 3, 7, 4, 8, 6)
  • Don't rush, get few data each day

However, the most effective way would be to contact the website owner and offer a partnership or pay for accessing the data using an API or something like that if they have this service.

like image 108
Paulo Marques Avatar answered Oct 23 '25 07:10

Paulo Marques



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!