I need to parse instagram user page without API and with proxy, and I use code like below
def client(options = {})
Faraday.new('https://www.instagram.com', ssl: { verify: false }, request: { timeout: 10 }) do |conn|
conn.request :url_encoded
conn.proxy options[:proxy]
conn.adapter :net_http
end
end
response = client.get('some_username/', proxy: URI('//111.111.111.111:8080'))
response.status # 302
response['location'] # "https://www.instagram.com/accounts/login/"
But previously, just a few days ago, code above worked as expected, i.e. had return 200 status and body with user page. Moreover code Faraday.get('https://www.instagram.com/some_username/')
without proxy works fine, i.e. returns 200 status and body with user page. I've also tried the same by other clients, and result the same, success without proxy and redirect with it.
Client needs some additional specific configuration for working with proxy, maybe?
UPDATE
I'm not sure, but it looks like a problem with proxy, i.e. instagram somehow detects buyed/free proxies, maybe, and redirects requests fromt thats proxies (I've used buyed pack of proxies), because I've tried to use my own proxy and it's works.
Instagram made a changes lately. They are most likely have some special AI or use some service which review your IP address, which ISP you use, is it belonging to organization like Digitalocean, OVH, etc or residential, how many requests are you making to which endpoints, how are you making them, how many accounts you use on it, and how quickly you change them etc.
Right now if you hit the limits of scraping instagram you will be redirected to LoginAndSignupPage
(you can find it in source code). Be aware that login on this point won't work - instagram will just return 429 error code
, meaning too many requests. Also after every such block most likely your IP address is even less reliable, so if you will start scraping again after block it will get blocked even faster.
I guess the easiest way will be just use residential ip with enough high delay between requests - like 3-5 seconds, and even better if you can use somehow real accounts, and don't overuse them, as well try to make any other requests in meantime, like getting some posts, opening single post or something.
You can ignore pretty much any free IP proxy list available on google, 99% of those ips on it are banned, almost same with ips from Digitalocean, OVH etc, many of them are blocked as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With