I am crawling the Google app store. I use Firefox+firebug to review the request and response. but one Parameter I don't understand. for example: the URL "" when load next page, it post a param pagTok, which's value is "EgIIKA==:S:ANO1ljJ4wWQ" I don't know where does this value come from? any one can help?
Investigation
Since Google recently changed their paging logic, and now it requires a token, I've found myself trying to investigate how to either manually generate those tokens, or to scrape them out of the HTML retrieved on each response. So, lets get our hands dirty.
Using Fiddler2, I was able to isolate some token samples, looking at the requests issued for each "Paging" of the Play Store.
Here's the whole request:
POST https://play.google.com/store/search?q=a&c=apps HTTP/1.1
Host: play.google.com
Connection: keep-alive
Content-Length: 123
Origin: https://play.google.com
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36
Content-Type: application/x-www-form-urlencoded;charset=UTF-8
Accept: */*
X-Client-Data: CIe2yQEIpLbJAQiptskBCMG2yQEInobKAQjuiMoBCImSygE=
Referer: https://play.google.com/store/search?q=a&c=apps
Accept-Encoding: gzip, deflate
Accept-Language: pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4,es;q=0.2
** Post Body **
start=0&num=0&numChildren=0&pagTok=GAEiAggU%3AS%3AANO1ljLtUJw&ipf=1&xhr=1&token=bH2MlNeViIRJA8dT-zhaKrfNH7Q%3A1420660393029
Now that we know what's the request, the next step is to keep track of more requests to try to isolate some token formation logic.
Here are 3 request tokens I could find :
"GAEiAggU%3AS%3AANO1ljLtUJw", "GAEiAggo%3AS%3AANO1ljIeRQQ", "GAEiAgg8%3AS%3AANO1ljIM1CI"
Finding the Patterns
One thing our brain is really good at, is to find patterns, here's what mine found about the tokens formation:
1 - Starts with : "GAEiA"
2 - Followed by : two random characters
3 - Followed by: "%3AS%3"
4 - Followed by : 11 random characters
Browser Javascript tricks x Manual HTTP Requests
Doing the same request on a browser, most of the time, won't yield the same results as it will using code, manually issuing an Http Request. Why ? Because of Javascript.
Google is a heavy JS user, so it will use it's own tricks to try to fool you.
If you look at the HTML, you will see no token that matches the pattern described above, instead, you will find something like:
u0026c\\u003dapps\42,\42GAEiAghQ:S:ANO1ljLxWBY\42,\0420\42,\0420\42,\0420\42]\n
If you look carefully, you will see that your token is within this "random string". All you have to do is to replace : ":S:" by "%3AS%" .
Regular Expressions For the Win
If you apply regexes to the page, you will be able to find the token, and than, manually replace the :S: string with the %3AS% one.
Here's the one I ended up using (powered by the best Regex online Builder
Generated regular expression:
/GAEi+.+:S:.{11}\42/
Textual meaning of regular expression:
TL:DR
The token comes into the HTML, but it is "masked" by Google, that "unmasks" it using Javascript (which you can only run if you are using Browser engines such as Selenium or something).
In order to fetch the pagToken of the next page, read the current page html, scrape it (logic above), use it on the next request, repeat.
I Hope it helps, sorry about the wall of text, I wanted to be as clear as possible
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With