Git fetch single file from remote repository programatically

Tags:

I'll say up front that this question is similar in nature to this. There's one key difference that makes this unique: I want to use the raw git protocol (see here and here if you're unfamiliar with the basic pack network protocol).

I'm writing an application using Scala and JGit that will connect to an anonymous git repository. I want to request a single blob (think "/path/to/file.txt" @ "refs/heads/branch1"). Ultimately my goal is to programatically retrieve a single file out of a remote repository. Seems like a pretty useful thing to be able to do.

Anywho, I've been delving into the internals of this protocol. It appears that the basic version of this is "I want these object(s), I have these object(s)" -- and bam, there's a packfile with everything you don't have. The core of my question is this: how do I ask git-upload-packfile for a single object in a non-recursive manner? I'm ok with downloading a single commit object, then asking for the tree, then a subtree, then another subtree, and then finally the blob itself. Speed isn't too important here, mainly I'm trying to save on bandwidth. But it seems that there's simply no way to tell git-upload-packfile, "please only give me the one object I asked for".

Yes, there's the "have" list, which will basically exclude objects from coming down, however that requires a priori knowledge of the contents of a repository (I don't have a local repo, remember). I could generate a list of all possible sha1s and send all of them except for the one I want, but that's beyond ridiculous (time consuming, bandwidth consuming, and a crime against programmers everywhere)

Another possible solution I've been delving into is using git-upload-archive on the remote side instead, although I admit I haven't spent much time looking in to it yet.

I'm more than willing to rewrite JGit if it comes to that, so please don't read this as "how do I make JGit do...". I just want to know if the protocol itself is even capable of this. I feel like there's some wonderfully clever way to abuse the protocol to acheive what I want. Any thoughts?

713

asked Jan 18 '13 19:01

Chris Eberle

1 Answers

Answering my own question. I found an acceptable (although barely documented) answer. I had to dig through a LOT of C code to figure this out.

First of all, the above requirements can't be achieved using git-upload-packfile because that's simply not what the program was designed to do. The correct answer as I suspected is git-upload-archive. Sadly the protocol is hardly documented at ALL. So here are my notes on it in case anyone else has similar requirements.

Basically what I'm trying to simulate here (in scala) is the following command:

git archive --format=tar --remote=ssh://[email protected]/cornballer.git \
  > master plans/documents/cornballer-blueprint.pdf | tar -x

Except in software, hopefully using JGit. Sadly JGit doesn't (yet) support git archive commands. So here's a very high-level overview of how to add support (I may fork JGit and add this at a later time).

Let's look at the protocol (from Documentation/technical/pack-protocol.txt):

git-proto-request = request-command SP pathname NUL [ host-parameter NUL ]
request-command   = "git-upload-pack" / "git-receive-pack" /
                    "git-upload-archive"   ; case sensitive
pathname          = *( %x01-ff ) ; exclude NUL
host-parameter    = "host=" hostname [ ":" port ]

So part one of the protocol goes something like this:

Establish a transport with the remote (either ssh and then run git-upload-archive or use the anonymous git protocol)
Send git-upload-archive /cornballer.git\0host=ssh.mycompany.com\0 (as a packet line)

At this point the connection is established. GIt may return an error if the command isn't supported or if there was any kind of problem. I haven't yet figured out how to check for this.

Next comes the undocumented part. We basically send command line arguments for git-archive over the wire. They're exactly the same as the git-archive command with one exception: they are all prefixed with argument[SPACE]. Each argument is written (at least in the reference implementation) as a separate packet line. So for the above example:

Send argument --format=tar (as a packet line)
Send argument master (as a packet line)
Send argument plans/documents/cornballer-blueprint.pdf (as a packet line)
Send a flush packet (0000)

At this point we've given the remote git-archive process the entire command. Now we read the response. We read one packet line back from the server, which will be one of the following responses:

ACK (meaning success -- ready to send the archive)
NACK [message] -- some kind of error, only found one instance of its use -- "unable to spawn subprocess"
ERR [message] -- an error occurred

If an ACK is sent, it will be followed by a flush packet (0000) and then the raw tar data. At this point you repeatedly read packet lines coming in on sideband #1 (the main data channel). When you reach a flush packet, you stop reading. Pretty simple.

So now you have the remote file, but what if you wanted to do some kind of clever caching? One reason I was so gung-ho on using git-upload-packfile is that it would let me record the commit ID and thus cache it locally and only refresh as needed. A tar file doesn't tell us that info right? Wrong!

From the man page of git-archive:

Additionally the commit ID is stored in a global extended pax header if the tar format is used; it can be extracted using git get-tar-commit-id. In ZIP files it is stored as a file comment.

Well that's great news! That's literally everything I wanted. In case you're wondering what the header looks like, here's a sample (no I'm not going to dissect pax headers):

pax_global_header00006660000000000000000000000064121002672560014513gustar00rootroot0000000000000052 comment=326756f834865880c9832b64238e7665632e9b67

So from my perspective, I simply need to set up a pipeline to automatically run the above steps, run it through an untar step (programatically) to perform the desired "fetch a single file from git" functionality.

123

answered Oct 16 '22 11:10

Chris Eberle

Related questions
                            
                                VS Code Git push is not pushing the code to remote
                            
                                How do I exclude the patch header with "git diff"?
                            
                                Where does bitbucket server store the repository?
                            
                                Gitlab: why "squash" merge creates two commits?
                            
                                How to run local Git project hooks in addition to core.hooksPath global hooks?
                            
                                Should you commit .vscode/symbols file to source control?
                            
                                fatal: protocol 'https' is not supported
                            
                                How to run commitlint in GitHub workflow on every commit of a push
                            
                                Excluding files from being deployed with Capistrano while still under version control with Git
                            
                                git - remove file from the repository
                            
                                How to 'git push' to a repo that was cloned as read-only onto my team's staging server
                            
                                git bundle: bundle tags and heads
                            
                                Git - automatically create distinct files for versions on conflict
                            
                                Jenkins and multiple git branches?
                            
                                How to remove working tree from a Git repository
                            
                                Is it safe to "git pull" when my working tree and/or index is dirty?
                            
                                how to escape special characters in .gitconfig proxy authentication
                            
                                gvimdiff mergetool for msysgit
                            
                                Setting upstream to a submodule (or how to include a GitHub fork as a submodule)
                            
                                SVN to Git migration - undefined author, but it is

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Git fetch single file from remote repository programatically

Tags:

git

protocols

Chris Eberle

People also ask

1 Answers

Chris Eberle

Recent Activity

Donate For Us