Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Instagram user page parsing (with proxy, without API)

I need to parse instagram user page without API and with proxy, and I use code like below

def client(options = {})
  Faraday.new('https://www.instagram.com', ssl: { verify: false }, request: { timeout: 10 }) do |conn|
    conn.request :url_encoded
    conn.proxy options[:proxy]
    conn.adapter :net_http
  end
end

response = client.get('some_username/', proxy: URI('//111.111.111.111:8080'))

response.status # 302
response['location'] # "https://www.instagram.com/accounts/login/"

But previously, just a few days ago, code above worked as expected, i.e. had return 200 status and body with user page. Moreover code Faraday.get('https://www.instagram.com/some_username/') without proxy works fine, i.e. returns 200 status and body with user page. I've also tried the same by other clients, and result the same, success without proxy and redirect with it.

Client needs some additional specific configuration for working with proxy, maybe?

UPDATE

I'm not sure, but it looks like a problem with proxy, i.e. instagram somehow detects buyed/free proxies, maybe, and redirects requests fromt thats proxies (I've used buyed pack of proxies), because I've tried to use my own proxy and it's works.

like image 929
O.Vykhor Avatar asked Mar 04 '23 13:03

O.Vykhor


1 Answers

Instagram made a changes lately. They are most likely have some special AI or use some service which review your IP address, which ISP you use, is it belonging to organization like Digitalocean, OVH, etc or residential, how many requests are you making to which endpoints, how are you making them, how many accounts you use on it, and how quickly you change them etc.

Right now if you hit the limits of scraping instagram you will be redirected to LoginAndSignupPage(you can find it in source code). Be aware that login on this point won't work - instagram will just return 429 error code, meaning too many requests. Also after every such block most likely your IP address is even less reliable, so if you will start scraping again after block it will get blocked even faster.

I guess the easiest way will be just use residential ip with enough high delay between requests - like 3-5 seconds, and even better if you can use somehow real accounts, and don't overuse them, as well try to make any other requests in meantime, like getting some posts, opening single post or something.

You can ignore pretty much any free IP proxy list available on google, 99% of those ips on it are banned, almost same with ips from Digitalocean, OVH etc, many of them are blocked as well.

like image 140
Juri Avatar answered Apr 02 '23 05:04

Juri