Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Crawling the Google Play store

I want to crawl the Google Play store to download the web pages of all the android application (All the webpages with the following base url: https://play.google.com/store/apps/). I checked the robots.txt file of the play store and it disallows crawling these URLs.

Also, when I browse the Google Play store I can only see top applications up to 3 pages for each of the categories. How can I get the other application pages?

If anyone has tried crawling the Google Play please let me know the following things: a) Were you successful in crawling the play store. If yes, please let me know how you did that. b) How to crawl the hidden application pages not visible in top apps for each of the categories? c) Is there a techniques to download the applications also and not just the webpages?

I already searched around and found the following links:

a) https://code.google.com/p/android-market-api/ 
b) https://code.google.com/p/android-marketplace-crawler/source/checkout 
c) http://mohsin-junaid.blogspot.co.uk/2012/12/how-to-install-android-marketplace.html 
d) http://mohsin-junaid.blogspot.in/2012/12/how-to-download-multiple-android-apks.html

Thanks!

like image 490
Naruto Uzumaki Avatar asked Jun 08 '13 17:06

Naruto Uzumaki


Video Answer


2 Answers

First of all, Google Play's robots.txt does NOT disallow the pages with base "/store/apps".

If you want to crawl Google Play you would need to develop your own web crawler, parse the HTML page and extract the app meta-data you need (e.g. title, descriptions, price, etc). This topic has been covered in this other question. There are libraries helping with that, for instance:

  • Java: https://jsoup.org
  • Python: https://scrapy.org

The harder part is to "find" the app-pages to crawl. You could use 1) the Google Play Sitemap or 2) follow the app-links you find in every page you crawl as explained in the Link Extractor documentation (in case you plan to use Scrapy).

Another option is to use an open-source library based on ProtoBuf to fetch meta-data about an app, here the link to the project: https://code.google.com/archive/p/android-market-api. This library fetches app meta-data from Google Play on behalf of a valid Google account, but also in this case you need a crawler to "find" which apps are available and schedule their meta-data retrieval. This other open-source project can help you with that: https://code.google.com/archive/p/android-marketplace-crawler.

If you don't want to implement all this by yourself, you could use a third-party managed service to access Android apps meta-data through a JSON-based API. For instance, 42matters.com (the company I work for) offers an API for both Android and iOS to retrieve apps' meta-data, here more details:

https://42matters.com/app-market-data

In order to get the Title, Icon, Description, Downloads for an app you can use the "lookup" endpoint as documented here:

https://42matters.com/docs/app-market-data/android/apps/lookup

This is an example of the JSON response for the "Angry Birds Space Premium" app:

{
    "package_name": "com.rovio.angrybirdsspace.premium",
    "title": "Angry Birds Space Premium",
    "description": "Play over 300 interstellar levels across 10 planets...",
    "short_desc": "The #1 mobile game of all time blasts off into space!",
    "rating": 4.3046236038208,
    "category": "Arcade",
    "cat_key": "GAME_ARCADE",
    "cat_keys": [
        "GAME_ARCADE",
        "GAME",
        "FAMILY_EDUCATION",
        "FAMILY"
    ],
    "price": "$1.15",
    "downloads": "1,000,000 - 5,000,000",
    "version": "2.2.1",
    "content_rating": "Everyone",
    "promo_video": "https://www.youtube.com/embed/g6AL9YqRHaI?ps=play&vq=large&rel=0&autohide=1&showinfo=0&autoplay=1",
    "market_update": "2015-07-03T00:00:00+00:00",
    "screenshots": [
        "https://lh3.googleusercontent.com/ZmuBQzIy1G74coPrQ1R7fCeKdJmjTdpJhNrIHBOaFyM0N2EYdUPwZaQjnQUtiUDGmac=h310",
        "https://lh3.googleusercontent.com/Xg2Aq70ZH0SnNhtSKH7xg9jCfisWgmmq3C7xQbx6YMhTVAIRqlRJeH8GYtjxapb_qR4=h310",
        "https://lh3.googleusercontent.com/T4o5-2_UP82sj4fSSegbjrGmslNHlfvtEYuZacXMSOC55-7eyiKySw05lNF1QQGO2FeU=h310",
        "https://lh3.googleusercontent.com/f2ennaLdivFu5cQQaVPKsRcWxB8FS5T4Bkoy3l0iPW9-GDDnTVRhvR5kz6l4m8FL1c8=h310",
        "https://lh3.googleusercontent.com/H-9M03_-O9Df1nHr2-rUdjtk2aeBY3bAxnqSX3m2zh_aV8-K1t0qU1DxLXnK0GrDAw=h310"
    ],
    "created": "2012-03-22T08:24:00+00:00",
    "developer": "Rovio Entertainment Ltd.",
    "number_ratings": 20812,
    "price_currency": "$",
    "icon": "https://lh3.ggpht.com/aQaIEGrmba1ENSEgUtArdm3yhJUug7BRWlu_WaspoJusZyHv1rjlWtYqe_qRjE_Kmh1E=w300",
    "icon_72": "https://lh3.ggpht.com/aQaIEGrmba1ENSEgUtArdm3yhJUug7BRWlu_WaspoJusZyHv1rjlWtYqe_qRjE_Kmh1E=w72",
    "market_url": "https://play.google.com/store/apps/details?id=com.rovio.angrybirdsspace.premium&referrer=utm_source%3D42matters.com%26utm_medium%3Dapi"
}

I hope this helps, otherwise feel free to get in touch with me. I know this topic quite well and can point you in the right direction.

Regards,

Andrea

like image 72
agirardello Avatar answered Sep 24 '22 20:09

agirardello


I have did the job in Python before, what you need is a web auto test lib called selenium, it can execute Javascript code and return the result to Python, with Javascript, you can click the "show more" button by the program itself. And when you get all links for a single category page, you can get some info for the app. The simple demo here. Hope helpful.

like image 27
lqhcpsgbl Avatar answered Sep 24 '22 20:09

lqhcpsgbl