Clone/download specific files from a GitHub repository

There is a Git repository on GitHub called platform_frameworks_base containing part of the Android source code.
I wrote an application that replies on all the .aidl files from that project, so it downloads them all on first start.
Until now I did that by downloading the file Android.bp from the project root, extracting all file paths ending in .aidl from that file and then explicitly downloading them one by one.

For example if I found this file path:


I knew I could download it like this:

wget https://raw.githubusercontent.com/aosp-mirror/platform_frameworks_base/android-10.0.0_r47/media/java/android/media/IAudioService.aidl

This works fine until Android 10 (git tag: android-10.0.0_r47).
Starting with Android 11 (e.g. git tag: android-11.0.0_r33), the file paths use wildwards instead of complete paths. See this Android.bp.

It now just contains wildcard/glob file paths like:



My current "solution":

  1. Clone the repo (only the last commit of the branch we care about):

    git clone --depth=1 -b android-11.0.0_r33 https://github.com/aosp-mirror/platform_frameworks_base.git

  2. Extract the wildcard/glob paths from Android.bp.

    cat Android.bp | grep '\.aidl"' | cut -d'"' -f2

  3. Find all the files matching the wildcard/glob paths.

    e.g. shopt -s globstar && echo media/java/**/*.aidl

But the download process takes waaaaay to long because the repository contains over a gigabyte of binary files. Even if I just clone the last commit of the branch I care about.

Now my actual question is either:
How can I just download the .aidl files that I actually care about? (Ideally without parsing the HTML of every folder in GitHub.)
How can I download/clone the repository without all the binary files? (probably not possible with git?)


I tried using the GitHub API to recursively go through all directories, but I immediately get an API rate limit exceeded error:


# Recursively go through all directories and the paths to all found .aidl files in the global g_aidlFile variable
GetAidlFilesFromGithub() {
    if [ "$l_dirUrl" == "" ]; then
        echo "ERROR: Directory URL not provided in GetAidlFilesFromGithub"
        exit 1
    echo "l_dirUrl: ${l_dirUrl}"
    l_rawRes="$(curl -s -i $l_dirUrl)"
    l_statusCode="$(echo "$l_rawRes" | grep HTTP | head -1 | cut -d' ' -f2)"
    l_resBody="$(echo "$l_rawRes" | sed '1,/^\s*$/d')"
    if [[ $l_statusCode == 4* ]] || [[ $l_statusCode == 5* ]]; then
        echo "ERROR: Request failed!"
        echo "Response status: $l_statusCode"
        echo "Reponse body:"
        echo "$l_resBody"
        exit 1
    l_currentDirJson="$(echo "$l_resBody")"
    if [ "$l_currentDirJson" == "" ]; then
        echo "ERROR: l_currentDirJson is empty"
        exit 1
    l_newAidlFiles="$(echo "$l_currentDirJson" | jq '.[] | select(.type=="file") | select(.path | endswith(".aidl")) | .path')"
    if [ "$l_newAidlFiles" != "" ]; then
        echo "l_newAidlFiles: ${l_newAidlFiles}"

    l_subDirUrls="$(echo "$l_currentDirJson" | jq '.[] | select(.type=="dir") | .url')"
    if [ "$l_subDirUrls" != "" ]; then
        echo "$l_subDirUrls" | while IFS= read -r l_subDirUrl ; do 
            (GetAidlFilesFromGithub "$l_subDirUrl")
        echo "No subdirs found."

GetAidlFilesFromGithub "https://api.github.com/repos/aosp-mirror/platform_frameworks_base/contents?ref=android-11.0.0_r33"

From what I understand all my users would have to create a GitHub account and create an OAUTH secret to raise the limit. That's definitely not an option for me. I want my application to be easy to use.

4 Answers

Since the repo's on GitHub, which supports filters, easiest is probably to use its filter support.

git clone --no-checkout --depth=1 --filter=blob:none \
cd platform_frameworks_base
git reset -q -- \*.aidl
git checkout-index -a

which could probably be finessed quite a bit to get the files sent in a single pack instead of the one-at-a-time-fetch that produces.

For instance, instead of blob:none say blob:limit=16384, that gets most of them up front.

To do this in your own code, without relying on a Git install, you'd need to implement the git protocol. Here's the online intro with pointers to the actual Git docs. It's not hard, you send text lines back and forth until the server spits the gobsmacking lot of data you wanted, then you pick through it. You don't need to use https, github supports the plain git protocol. Try running that clone command with GIT_TRACE=1 GIT_PACKET_TRACE=1.

Not sure if this is what you wanted :

#!/usr/bin/env bash
    local user=$1 repo=$2 branch=$3
    curl -s "https://api.github.com/repos/$user/$repo/git/trees/$branch?recursive=1"

get_github_file_list aosp-mirror platform_frameworks_base android-11.0.0_r33 |\
    jq -r '.tree|map(.path|select(test("\\.aidl")))[]'
You could use GitHub API code search endpoint to get the paths, but then use your wget raw.githubusercontent method to download them:

for apiurl in "$apiurl1" "$apiurl2"; do
  while paths=$(
    curl -s "$apiurl&page=$page" | grep '"path": ' | grep -o '[^"]\+\.aidl'
  ); do
    # do your stuff with the $paths
    page=$(($page + 1))

Unfortunately, the GitHub API code search endpoint only searches the default branch (in this case, master), whereas you want the android-10.0.0_r47 tag. There could be files in android-10.0.0_r47 but not in master, and this code won't find and download these.

An alternative solution is to do a very minimal clone of each tag you're interested in, and then use git ls-tree to get the paths of each tag, e.g.,

for tag in 'android-10.0.0_r47' 'android-11.0.0_r33'; do
  git clone --branch "$tag" --depth=1 --bare --no-checkout \
    --filter=blob:limit=0 [email protected]:aosp-mirror/platform_frameworks_base.git
  # only a 1.8M download
  mv platform_frameworks_base.git "$tag"
  cd "$tag"
  paths=$(git ls-tree -r HEAD --name-only | grep '\.aidl$')
  # do your stuff with the paths
  cd ..

If this is for own use, I probably wouldn't use either of these methods. I would just clone the entire huge repo once and then work with it locally, e.g.,

if [ -e platform_frameworks_base ]; then
  cd platform_frameworks_base
  git pull
  git clone [email protected]:aosp-mirror/platform_frameworks_base.git
  cd platform_frameworks_base
tags=$(git tag | grep '^android')
for tag in $tags; do
  git checkout $tag
  paths=$(git ls-tree -r HEAD --name-only | grep '\.aidl$')
  # do your stuff with the paths
Give the circumstances I would maintain a text file that is automatically updated with the latest repo file tree before each commit.

The script should be easy to write and be fast to run since all this is happening locally. It can be called manually by introducing a new work process or be integrated into your test/CI automation process.

Then you know what to do in your end-user application, download this file first, filter it out with the Android.bp, then extract the files you want with the Github raw content links.

