Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clone/download specific files from a GitHub repository

There is a Git repository on GitHub called platform_frameworks_base containing part of the Android source code.
I wrote an application that replies on all the .aidl files from that project, so it downloads them all on first start.
Until now I did that by downloading the file Android.bp from the project root, extracting all file paths ending in .aidl from that file and then explicitly downloading them one by one.

For example if I found this file path:

media/java/android/media/IAudioService.aidl

I knew I could download it like this:

wget https://raw.githubusercontent.com/aosp-mirror/platform_frameworks_base/android-10.0.0_r47/media/java/android/media/IAudioService.aidl

This works fine until Android 10 (git tag: android-10.0.0_r47).
Starting with Android 11 (e.g. git tag: android-11.0.0_r33), the file paths use wildwards instead of complete paths. See this Android.bp.

It now just contains wildcard/glob file paths like:

media/java/**/*.aidl
location/java/**/*.aidl

etc...

My current "solution":

  1. Clone the repo (only the last commit of the branch we care about):

    git clone --depth=1 -b android-11.0.0_r33 https://github.com/aosp-mirror/platform_frameworks_base.git

  2. Extract the wildcard/glob paths from Android.bp.

    cat Android.bp | grep '\.aidl"' | cut -d'"' -f2

  3. Find all the files matching the wildcard/glob paths.

    e.g. shopt -s globstar && echo media/java/**/*.aidl

But the download process takes waaaaay to long because the repository contains over a gigabyte of binary files. Even if I just clone the last commit of the branch I care about.

Now my actual question is either:
How can I just download the .aidl files that I actually care about? (Ideally without parsing the HTML of every folder in GitHub.)
Or
How can I download/clone the repository without all the binary files? (probably not possible with git?)

Edit:

I tried using the GitHub API to recursively go through all directories, but I immediately get an API rate limit exceeded error:

g_aidlFiles=""

# Recursively go through all directories and the paths to all found .aidl files in the global g_aidlFile variable
GetAidlFilesFromGithub() {
    l_dirUrl="${1-}"
    if [ "$l_dirUrl" == "" ]; then
        echo "ERROR: Directory URL not provided in GetAidlFilesFromGithub"
        exit 1
    fi
    
    echo "l_dirUrl: ${l_dirUrl}"
    
    l_rawRes="$(curl -s -i $l_dirUrl)"
    l_statusCode="$(echo "$l_rawRes" | grep HTTP | head -1 | cut -d' ' -f2)"
    l_resBody="$(echo "$l_rawRes" | sed '1,/^\s*$/d')"
    if [[ $l_statusCode == 4* ]] || [[ $l_statusCode == 5* ]]; then
        echo "ERROR: Request failed!"
        echo "Response status: $l_statusCode"
        echo "Reponse body:"
        echo "$l_resBody"
        exit 1
    fi
    
    l_currentDirJson="$(echo "$l_resBody")"
    if [ "$l_currentDirJson" == "" ]; then
        echo "ERROR: l_currentDirJson is empty"
        exit 1
    fi
    
    l_newAidlFiles="$(echo "$l_currentDirJson" | jq '.[] | select(.type=="file") | select(.path | endswith(".aidl")) | .path')"
    
    if [ "$l_newAidlFiles" != "" ]; then
        echo "l_newAidlFiles: ${l_newAidlFiles}"
        g_aidlFiles="${g_aidlFiles}\n${l_newAidlFiles}"
    fi

    l_subDirUrls="$(echo "$l_currentDirJson" | jq '.[] | select(.type=="dir") | .url')"
    if [ "$l_subDirUrls" != "" ]; then
        echo "$l_subDirUrls" | while IFS= read -r l_subDirUrl ; do 
            (GetAidlFilesFromGithub "$l_subDirUrl")
        done
    else
        echo "No subdirs found."
    fi
}

GetAidlFilesFromGithub "https://api.github.com/repos/aosp-mirror/platform_frameworks_base/contents?ref=android-11.0.0_r33"

From what I understand all my users would have to create a GitHub account and create an OAUTH secret to raise the limit. That's definitely not an option for me. I want my application to be easy to use.

like image 426
Forivin Avatar asked Mar 12 '21 13:03

Forivin


People also ask

How do I download individual files from repository?

Downloading a Single File From The Github Website You can copy/paste from here, but in most browsers, you should be able to right click and select “Save As” to download the file directly. For code files, it may try to save as . txt , which you will need to fix manually before or after downloading.

How do I clone a single file from a git repository?

You can't clone a single file using git. Git is a distributed version control system, the Idea behind its clone functionality is to have a complete copy of project and all versions of files related to that project.

How do I clone a specific folder from a git repository?

Go to the current directory where you want the cloned directory to be added. To do this, input cd and add your folder location. You can add the folder location by dragging the folder to Git bash. Click on “Clone or download” and copy the URL.

How do I clone a GitHub repository?

Cloning a repository On GitHub, navigate to the main page of the repository. Above the list of files, click Code . To clone the repository using HTTPS, under "Clone with HTTPS", click . To clone the repository using an SSH key,... Open Terminal Terminal Git Bash. Change the current working directory ...

What is the use of a clone in Git?

Git is a distributed version control system, the Idea behind its clone functionality is to have a complete copy of project and all versions of files related to that project. Either download your file directly from here or clone the whole project using:

How to download a single folder from a GitHub repo?

The new topic you opened isn’t necessary and it is correct that people should be directed here. You can try github-files-fetcher, it is a command line tool which downloads a single folder or file from a GitHub repo. Think a real scenario: you were visiting the following webpage page and wanna download the async subdirectory alone.

How do I download a file from GitHub?

Copy the specific file's raw link from GitHub.(As you open the file in Github, on the top right corner you can see the option to open the file in raw mode. Open it in raw mode and copy the URL) Now use curl command in command line to download the file.


4 Answers

Since the repo's on GitHub, which supports filters, easiest is probably to use its filter support.

git clone --no-checkout --depth=1 --filter=blob:none \
        https://github.com/aosp-mirror/platform_frameworks_base
cd platform_frameworks_base
git reset -q -- \*.aidl
git checkout-index -a

which could probably be finessed quite a bit to get the files sent in a single pack instead of the one-at-a-time-fetch that produces.

For instance, instead of blob:none say blob:limit=16384, that gets most of them up front.

To do this in your own code, without relying on a Git install, you'd need to implement the git protocol. Here's the online intro with pointers to the actual Git docs. It's not hard, you send text lines back and forth until the server spits the gobsmacking lot of data you wanted, then you pick through it. You don't need to use https, github supports the plain git protocol. Try running that clone command with GIT_TRACE=1 GIT_PACKET_TRACE=1.

like image 125
jthill Avatar answered Oct 22 '22 02:10

jthill


Not sure if this is what you wanted :

#!/usr/bin/env bash
  
get_github_file_list(){
    local user=$1 repo=$2 branch=$3
    curl -s "https://api.github.com/repos/$user/$repo/git/trees/$branch?recursive=1"
}

get_github_file_list aosp-mirror platform_frameworks_base android-11.0.0_r33 |\
    jq -r '.tree|map(.path|select(test("\\.aidl")))[]'
like image 4
Philippe Avatar answered Oct 22 '22 02:10

Philippe


You could use GitHub API code search endpoint to get the paths, but then use your wget raw.githubusercontent method to download them:

apiurlbase='https://api.github.com/search/code?per_page=100&q=repo:aosp-mirror/platform_frameworks_base+extension:aidl'
dlurlbase='https://raw.githubusercontent.com/aosp-mirror/platform_frameworks_base/android-10.0.0_r47/'
apiurl1="$apiurlbase+path:/media/java/"
apiurl2="$apiurlbase+path:/location/java/"
for apiurl in "$apiurl1" "$apiurl2"; do
  page=1
  while paths=$(
    curl -s "$apiurl&page=$page" | grep '"path": ' | grep -o '[^"]\+\.aidl'
  ); do
    # do your stuff with the $paths
    page=$(($page + 1))
  done
done

Unfortunately, the GitHub API code search endpoint only searches the default branch (in this case, master), whereas you want the android-10.0.0_r47 tag. There could be files in android-10.0.0_r47 but not in master, and this code won't find and download these.

An alternative solution is to do a very minimal clone of each tag you're interested in, and then use git ls-tree to get the paths of each tag, e.g.,

for tag in 'android-10.0.0_r47' 'android-11.0.0_r33'; do
  git clone --branch "$tag" --depth=1 --bare --no-checkout \
    --filter=blob:limit=0 [email protected]:aosp-mirror/platform_frameworks_base.git
  # only a 1.8M download
  mv platform_frameworks_base.git "$tag"
  cd "$tag"
  paths=$(git ls-tree -r HEAD --name-only | grep '\.aidl$')
  # do your stuff with the paths
  cd ..
done

If this is for own use, I probably wouldn't use either of these methods. I would just clone the entire huge repo once and then work with it locally, e.g.,

if [ -e platform_frameworks_base ]; then
  cd platform_frameworks_base
  git pull
else
  git clone [email protected]:aosp-mirror/platform_frameworks_base.git
  cd platform_frameworks_base
fi
tags=$(git tag | grep '^android')
for tag in $tags; do
  git checkout $tag
  paths=$(git ls-tree -r HEAD --name-only | grep '\.aidl$')
  # do your stuff with the paths
done
like image 1
webb Avatar answered Oct 22 '22 01:10

webb


Give the circumstances I would maintain a text file that is automatically updated with the latest repo file tree before each commit.

The script should be easy to write and be fast to run since all this is happening locally. It can be called manually by introducing a new work process or be integrated into your test/CI automation process.

Then you know what to do in your end-user application, download this file first, filter it out with the Android.bp, then extract the files you want with the Github raw content links.

like image 1
alex Avatar answered Oct 22 '22 02:10

alex