Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform unauthenticated Instagram web scraping in response to recent private API changes?

Months ago, Instagram began rendering their public API inoperable by removing most features and refusing to accept new applications for most permissions scopes. Further changes were made this week which further constricts developer options.

Many of us have turned to Instagram's private web API to implement the functionality we previously had. One standout ping/instagram_private_api manages to rebuild most of the prior functionality, however, with the publicly announced changes this week, Instagram also made underlying changes to their private API, requiring in magic variables, user-agents, and MD5 hashing to make web scraping requests possible. This can be seen by following the recent releases on the previously linked git repository, and the exact changes needed to continue fetching data can be seen here.

These changes include:

  • Persisting the User Agent & CSRF token between requests.
  • Making an initial request to https://instagram.com/ to grab an rhx_gis magic key from the response body.
  • Setting the X-Instagram-GIS header, which is formed by magically concatenating the rhx_gis key and query variables before passing them through an MD5 hash.

Anything less than this will result in a 403 error. These changes have been implemented successfully in the above repository, however, my attempt in JS continues to fail. In the below code, I am attempting to fetch the first 9 posts from a user timeline. The query parameters which determine this are:

  • query_hash of 42323d64886122307be10013ad2dcc44 (fetch media from the user's timeline).
  • variables.id of any user ID as a string (the user to fetch media from).
  • variables.first, the number of posts to fetch, as an integer.

Previously, this request could be made without any of the above changes by simply GETting from https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables=%7B%22id%22%3A%225380311726%22%2C%22first%22%3A1%7D, as the URL was unprotected.

However, my attempt at implementing the functionality to successfully written in the above repository is not working, and I only receive 403 responses from Instagram. I'm using superagent as my requests library, in a node environment.

/* ** Retrieve an arbitrary cookie value by a given key. */ const getCookieValueFromKey = function(key, cookies) {         const cookie = cookies.find(c => c.indexOf(key) !== -1);         if (!cookie) {             throw new Error('No key found.');         }         return (RegExp(key + '=(.*?);', 'g').exec(cookie))[1];     };  /* ** Calculate the value of the X-Instagram-GIS header by md5 hashing together the rhx_gis variable and the query variables for the request. */ const generateRequestSignature = function(rhxGis, queryVariables) {     return crypto.createHash('md5').update(`${rhxGis}:${queryVariables}`, 'utf8').digest("hex"); };  /* ** Begin */ const userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/604.3.5 (KHTML, like Gecko) Version/11.0.1 Safari/604.3.5';  // Make an initial request to get the rhx_gis string const initResponse = await superagent.get('https://www.instagram.com/'); const rhxGis = (RegExp('"rhx_gis":"([a-f0-9]{32})"', 'g')).exec(initResponse.text)[1];  const csrfTokenCookie = getCookieValueFromKey('csrftoken', initResponse.header['set-cookie']);  const queryVariables = JSON.stringify({     id: "123456789",     first: 9 });  const signature = generateRequestSignature(rhxGis, queryVariables);  const res = await superagent.get('https://www.instagram.com/graphql/query/')     .query({         query_hash: '42323d64886122307be10013ad2dcc44',         variables: queryVariables     })     .set({         'User-Agent': userAgent,         'X-Instagram-GIS': signature,         'Cookie': `rur=FRC;csrftoken=${csrfTokenCookie};ig_pr=1`     })); 

What else should I try? What makes my code fail, and the provided code in the repository above work just fine?

Update (2018-04-17)

For at least the 3rd time in a week, Instagram has again updated their API. The change no longer requires the CSRF Token to form part of the hashed signature.

The question above has been updated to reflect this.

Update (2018-04-14)

Instagram has again updated their private graphql API. As far as anyone can figure out:

  • User Agent is no longer needed to be included in the X-Instagram-Gis md5 calculation.

The question above has been updated to reflect this.

like image 551
marked-down Avatar asked Apr 12 '18 02:04

marked-down


People also ask

Does Instagram allow data scraping?

While Instagram forbids any kind of crawling, scraping, or caching content from Instagram it is not regulated by law. Meaning, if you scrape data from Instagram you may get your account banned, but there are no legal repercussions.

How do you make a hash query on Instagram?

You can find the query hash from the request made to the endpoint https://www.instagram.com/graphql/query/ . Once again you can filter the requests using the textbox to find the specific request. Save the value of the query_hash query-parameter for later.


2 Answers

Values to persist

You aren't persisting the User Agent (a requirement) in the first query to Instagram:

const initResponse = await superagent.get('https://www.instagram.com/'); 

Should be:

const initResponse = await superagent.get('https://www.instagram.com/')                      .set('User-Agent', userAgent); 

This must be persisted in each request, along with the csrftoken cookie.

X-Instagram-GIS header generation

As your answer shows, you must generate the X-Instagram-GIS header from two properties, the rhx_gis value which is found in your initial request, and the query variables in your next request. These must be md5 hashed, as shown in your function above:

const generateRequestSignature = function(rhxGis, queryVariables) {     return crypto.createHash('md5').update(`${rhxGis}:${queryVariables}`, 'utf8').digest("hex"); }; 
like image 90
Alex Avatar answered Sep 19 '22 11:09

Alex


So in order to call instagram query you need to generate x-instagram-gis header.

To generate this header you need to calculate a md5 hash of the next string "{rhx_gis}:{path}". The rhx_gis value is stored in the source code of instagram page in the window._sharedData global js variable.

Example:
If you try to GET user info request like this https://www.instagram.com/{username}/?__a=1
You need to add http header x-instagram-gis to request which value is
MD5("{rhx_gis}:/{username}/")

This is tested and works 100%, so feel free to ask if something goes wrong.

like image 23
olllejik Avatar answered Sep 18 '22 11:09

olllejik