Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Git fetch single file from remote repository programatically

Tags:

git

protocols

I'll say up front that this question is similar in nature to this. There's one key difference that makes this unique: I want to use the raw git protocol (see here and here if you're unfamiliar with the basic pack network protocol).

I'm writing an application using Scala and JGit that will connect to an anonymous git repository. I want to request a single blob (think "/path/to/file.txt" @ "refs/heads/branch1"). Ultimately my goal is to programatically retrieve a single file out of a remote repository. Seems like a pretty useful thing to be able to do.

Anywho, I've been delving into the internals of this protocol. It appears that the basic version of this is "I want these object(s), I have these object(s)" -- and bam, there's a packfile with everything you don't have. The core of my question is this: how do I ask git-upload-packfile for a single object in a non-recursive manner? I'm ok with downloading a single commit object, then asking for the tree, then a subtree, then another subtree, and then finally the blob itself. Speed isn't too important here, mainly I'm trying to save on bandwidth. But it seems that there's simply no way to tell git-upload-packfile, "please only give me the one object I asked for".

Yes, there's the "have" list, which will basically exclude objects from coming down, however that requires a priori knowledge of the contents of a repository (I don't have a local repo, remember). I could generate a list of all possible sha1s and send all of them except for the one I want, but that's beyond ridiculous (time consuming, bandwidth consuming, and a crime against programmers everywhere)

Another possible solution I've been delving into is using git-upload-archive on the remote side instead, although I admit I haven't spent much time looking in to it yet.

I'm more than willing to rewrite JGit if it comes to that, so please don't read this as "how do I make JGit do...". I just want to know if the protocol itself is even capable of this. I feel like there's some wonderfully clever way to abuse the protocol to acheive what I want. Any thoughts?

like image 713
Chris Eberle Avatar asked Jan 18 '13 19:01

Chris Eberle


People also ask

How do I pull just one file from a git repository?

git checkout origin/master -- path/to/file // git checkout <local repo name (default is origin)>/<branch name> -- path/to/file will checkout the particular file from the downloaded changes (origin/master).

How do I pull a single branch from a remote?

If you have a single remote repository, then you can omit all arguments. just need to run git fetch , which will retrieve all branches and updates, and after that, run git checkout <branch> which will create a local copy of the branch because all branches are already loaded in your system.

Can I clone a single file from git?

You can't clone a single file using git. Git is a distributed version control system, the Idea behind its clone functionality is to have a complete copy of project and all versions of files related to that project.

How do I checkout a single file from master?

git checkout origin/master -- path/to/file // git checkout / -- path/to/file will checkout the particular file from the downloaded changes (origin/master). That's it!


1 Answers

Answering my own question. I found an acceptable (although barely documented) answer. I had to dig through a LOT of C code to figure this out.

First of all, the above requirements can't be achieved using git-upload-packfile because that's simply not what the program was designed to do. The correct answer as I suspected is git-upload-archive. Sadly the protocol is hardly documented at ALL. So here are my notes on it in case anyone else has similar requirements.

Basically what I'm trying to simulate here (in scala) is the following command:

git archive --format=tar --remote=ssh://[email protected]/cornballer.git \
  > master plans/documents/cornballer-blueprint.pdf | tar -x

Except in software, hopefully using JGit. Sadly JGit doesn't (yet) support git archive commands. So here's a very high-level overview of how to add support (I may fork JGit and add this at a later time).

Let's look at the protocol (from Documentation/technical/pack-protocol.txt):

git-proto-request = request-command SP pathname NUL [ host-parameter NUL ]
request-command   = "git-upload-pack" / "git-receive-pack" /
                    "git-upload-archive"   ; case sensitive
pathname          = *( %x01-ff ) ; exclude NUL
host-parameter    = "host=" hostname [ ":" port ]

So part one of the protocol goes something like this:

  1. Establish a transport with the remote (either ssh and then run git-upload-archive or use the anonymous git protocol)
  2. Send git-upload-archive /cornballer.git\0host=ssh.mycompany.com\0 (as a packet line)

At this point the connection is established. GIt may return an error if the command isn't supported or if there was any kind of problem. I haven't yet figured out how to check for this.

Next comes the undocumented part. We basically send command line arguments for git-archive over the wire. They're exactly the same as the git-archive command with one exception: they are all prefixed with argument[SPACE]. Each argument is written (at least in the reference implementation) as a separate packet line. So for the above example:

  1. Send argument --format=tar (as a packet line)
  2. Send argument master (as a packet line)
  3. Send argument plans/documents/cornballer-blueprint.pdf (as a packet line)
  4. Send a flush packet (0000)

At this point we've given the remote git-archive process the entire command. Now we read the response. We read one packet line back from the server, which will be one of the following responses:

  1. ACK (meaning success -- ready to send the archive)
  2. NACK [message] -- some kind of error, only found one instance of its use -- "unable to spawn subprocess"
  3. ERR [message] -- an error occurred

If an ACK is sent, it will be followed by a flush packet (0000) and then the raw tar data. At this point you repeatedly read packet lines coming in on sideband #1 (the main data channel). When you reach a flush packet, you stop reading. Pretty simple.

So now you have the remote file, but what if you wanted to do some kind of clever caching? One reason I was so gung-ho on using git-upload-packfile is that it would let me record the commit ID and thus cache it locally and only refresh as needed. A tar file doesn't tell us that info right? Wrong!

From the man page of git-archive:

Additionally the commit ID is stored in a global extended pax header if the tar format is used; it can be extracted using git get-tar-commit-id. In ZIP files it is stored as a file comment.

Well that's great news! That's literally everything I wanted. In case you're wondering what the header looks like, here's a sample (no I'm not going to dissect pax headers):

pax_global_header00006660000000000000000000000064121002672560014513gustar00rootroot0000000000000052 comment=326756f834865880c9832b64238e7665632e9b67

So from my perspective, I simply need to set up a pipeline to automatically run the above steps, run it through an untar step (programatically) to perform the desired "fetch a single file from git" functionality.

like image 123
Chris Eberle Avatar answered Oct 16 '22 11:10

Chris Eberle