Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Crawling Google play store

I am crawling the Google app store. I use Firefox+firebug to review the request and response. but one Parameter I don't understand. for example: the URL "" when load next page, it post a param pagTok, which's value is "EgIIKA==:S:ANO1ljJ4wWQ" I don't know where does this value come from? any one can help?

like image 312
wolfwolfgod Avatar asked Dec 20 '22 08:12

wolfwolfgod


1 Answers

Investigation

Since Google recently changed their paging logic, and now it requires a token, I've found myself trying to investigate how to either manually generate those tokens, or to scrape them out of the HTML retrieved on each response. So, lets get our hands dirty.

Using Fiddler2, I was able to isolate some token samples, looking at the requests issued for each "Paging" of the Play Store.

Here's the whole request:

POST https://play.google.com/store/search?q=a&c=apps HTTP/1.1
Host: play.google.com
Connection: keep-alive
Content-Length: 123
Origin: https://play.google.com
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)       Chrome/39.0.2171.95 Safari/537.36
Content-Type: application/x-www-form-urlencoded;charset=UTF-8
Accept: */*
X-Client-Data: CIe2yQEIpLbJAQiptskBCMG2yQEInobKAQjuiMoBCImSygE=
Referer: https://play.google.com/store/search?q=a&c=apps
Accept-Encoding: gzip, deflate
Accept-Language: pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4,es;q=0.2

** Post Body **
start=0&num=0&numChildren=0&pagTok=GAEiAggU%3AS%3AANO1ljLtUJw&ipf=1&xhr=1&token=bH2MlNeViIRJA8dT-zhaKrfNH7Q%3A1420660393029

Now that we know what's the request, the next step is to keep track of more requests to try to isolate some token formation logic.

Here are 3 request tokens I could find :

"GAEiAggU%3AS%3AANO1ljLtUJw", "GAEiAggo%3AS%3AANO1ljIeRQQ", "GAEiAgg8%3AS%3AANO1ljIM1CI"

Finding the Patterns

One thing our brain is really good at, is to find patterns, here's what mine found about the tokens formation:

1 - Starts with : "GAEiA"

2 - Followed by : two random characters

3 - Followed by: "%3AS%3"

4 - Followed by : 11 random characters

Browser Javascript tricks x Manual HTTP Requests

Doing the same request on a browser, most of the time, won't yield the same results as it will using code, manually issuing an Http Request. Why ? Because of Javascript.

Google is a heavy JS user, so it will use it's own tricks to try to fool you.

If you look at the HTML, you will see no token that matches the pattern described above, instead, you will find something like:

u0026c\\u003dapps\42,\42GAEiAghQ:S:ANO1ljLxWBY\42,\0420\42,\0420\42,\0420\42]\n

If you look carefully, you will see that your token is within this "random string". All you have to do is to replace : ":S:" by "%3AS%" .

Regular Expressions For the Win

If you apply regexes to the page, you will be able to find the token, and than, manually replace the :S: string with the %3AS% one.

Here's the one I ended up using (powered by the best Regex online Builder

Generated regular expression:

/GAEi+.+:S:.{11}\42/

Textual meaning of regular expression:

  • Match a string which contains the string GAE
  • followed by the character i 1 or more times
  • followed by any character 1 or more times
  • followed by the string :S:
  • followed by any character exactly 11 times
  • followed by the string \42

TL:DR

The token comes into the HTML, but it is "masked" by Google, that "unmasks" it using Javascript (which you can only run if you are using Browser engines such as Selenium or something).

In order to fetch the pagToken of the next page, read the current page html, scrape it (logic above), use it on the next request, repeat.

I Hope it helps, sorry about the wall of text, I wanted to be as clear as possible

like image 72
Marcello Grechi Lins Avatar answered May 21 '23 13:05

Marcello Grechi Lins