Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accidentally committed sensitive information - GitLab

Tags:

git

gitlab

I accidentally committed a file with sensitive data. I need to update that file by removing the sensitive data and ensure the older version doesn't show up in the history.

I understand that those who have the repo cloned locally will still have access to it. But once they pull the latest, can it be setup in a way that they will not see the sensitive data moving forward or will not be able to see it in the logs?

like image 417
DotnetDude Avatar asked Dec 11 '22 04:12

DotnetDude


2 Answers

While GitLab is not generally as public as GitHub, the general rules about data apply here: if you've given sensitive / secret data to someone who cannot be trusted, your secret is already out and you should stop depending on it.

That means the key question is not—or at least, not yet—"how do I convince GitLab to forget my secrets" but rather "do I completely, totally trust both the GitLab server(s) and everyone else that has had access to those server(s) all this time?" If the answer is "no" you must stop depending on this secret anyway.

That said, here are rules about how Git itself stores the data. Assuming your GitLab server(s) is/are using only Git (and not some additional things built atop them that may add yet more ways to access the data that provide even more ways for your sensitive / secret data to leak), all you have to do is convince the GitLab servers to do the same thing you would do in your own Git.

Git's underlying storage model is that a repository is a collection of what Git calls objects. Each object has a unique hash ID, and is one of four types: blob, tree, commit and annotated tag. A blob is, roughly, file data. If the sensitive / secret data are inside a file, they are actually inside a blob object. A tree pairs up—well, more than pair, but let's use that for now1—each file's name with its blob hash ID, so if the file's name is the sensitive / secret data, your secret is actually inside a tree object. A commit object contains your name, email address, time stamp, log message, and the hash ID of some previous or parent commit(s), along with the hash ID of the tree that holds the files that make up the snapshot that is that commit. An annotated tag object holds much the same as a commit except that instead of a tree object, it usually has the hash ID of a commit; this is where one usually stores a PGP signature marking some particular commit as "blessed" and, say, called version 2.3.4 or whatever.

Assuming your secrets are in one particular file, whose name itself is not secret, your goal at this point is to cause your Git to stop using the blob that holds that particular file's data. To do so, you must cause the object itself to become unreferenced, and then use git gc to make Git physically remove the unreferenced object. At this point, a long aside into reachability in general is useful, but I'll outsource it to Think Like (a) Git. Let's just say here that in general, right after you've accidentally committed some secret file, the way that Git finds the commit object is using a branch name:

... <-F <-G <-H   <--master

The name master contains the hash ID of commit H. Commit H contains the hash ID of its parent commit, commit G, so for Git to find commit G, it starts by reading the name master (which produces hash ID H) and then reading the commit object from the database (which produces one tree object and one parent commit hash, G, along with the log message and your name and email address and so on), throws out all but the hash of G, and then reads the actual commit object G from the database. If you have asked Git to get some particular file—or more precisely, that file's content—from commit G, it then uses G's tree to find the hash ID of the blob containing that file, then gets the blob object from the database, and now Git has the content.

So, suppose your secret data are in a blob attached to a tree attached to commit H, and those same data are not in any other file—so that no tree attached to any other commit will have the hash ID of that blob. Then, to make H itself unreferenced, just make the name master point to G instead of H:

git checkout master
git reset --hard HEAD~1

Now you have:

...--E--F--G   <-- master
            \
             H   [abandoned]

But while H has no obvious name holding its hash ID, we're not yet done: git gc won't—at least not yet—remove H, and here's where things start to get complicated.

If there are valuable files in H, we can push H aside, using git commit --amend, to make a new commit I whose parent is G instead of H, and have master point to I:

... edit files, git add, git commit --amend ...

giving:

             H   [abandoned]
            /
...--E--F--G--I   <-- master

1Technically, each tree entry has:

  • the entry's mode, a text string like 100755 or 100644. The string is 40000 if the entry is for a sub-tree.
  • a string of bytes holding the file's name, generally in UTF-8 encoding
  • the hash ID that goes with the entry

(The mode and name are separated by a space, and the name is terminated by an ASCII NUL, while the hash ID is encoded in 20 binary bytes. This is going to have to change when Git switches to SHA-256. I don't think the new format is as-yet decided, but it could be as simple as, say, using a mode of 0n where n is a version number, as the mode is in octal with leading zeros suppressed, so no existing tree will have 01 as a mode. Or, perhaps it might be a NUL byte followed by a version number, since that too is currently an invalid tree entry.) Hence for sub-directories, the tree just lists sub-trees, and for regular files there are two values plus a hash. For symlinks, the hash ID is still that of a blob, but the blob's content is the target of the symbolic link; and for gitlinks for submodules, the hash ID is that of the commit Git should git checkout in the submodule.


The main complication is reflogs

The part of Git that does remember H for you, even after you git reset it away, is what Git calls reflogs. A reflog remembers the previous values of a reference. That is, the branch name master might point to H right now, before we git reset it. Then it points to G or I right now, after we use git reset --hard or git commit --amend to discard commit H. But it used to point to H, so H's hash ID is in the reflog for the name master.

The @{1} or @{yesterday} syntax is how you tell Git to look up these reflog values. Writing master@{1} tells your Git: look in my master reflog, and get me the immediately-previous value of master. The fact that this entry exists will make your Git retain commit H which will make your Git retain the blob containing the secret.

There are in fact at least two reflogs containing the hash ID of commit H: one for master, in master@{1}, and one for HEAD itself. So if you are to convince your Git to really discard commit H, and hence discard the tree for H, and hence discard any blobs unique to the tree for H, you must make these reflog entries go away.

Normally, they go away on their own, generally after about 30 days. This happens because each reflog entry has a time-stamp as well, and git reflog expire will expire—and remove—old reflog entries based on this time-stamp, vs the current time on your computer. The master git gc command runs git reflog expire for you, and sets it up to expire unreachable commits2 in 30 days by default. (Reachable commits get 90 days by default.) So on your own Git, you would need to run:

git reflog expire --expire-unreachable=now --all

to tell your Git: Find all unreachable commits like H and expire their reflog entries now.


2Technically, it's unreachable from the current value of the reference. That is, Git is not going to test a global reachability here, but rather do a somewhat simpler test: does this reflog entry point to a commit that is an ancestor of the commit to which the reference itself points right now?


The secondary complication is the object-prune grace time

Even after expiring the reflog entries from both HEAD and the branch name, you'll find that your own git gc does not immediately discard the blob object. The reason is that all Git objects have a grace period during which git gc won't prune them away. The default grace period is 14 days. This gives all Git commands some time during which they can create objects without worrying about them, as long as they finish all their work within that 14 day period by linking all those objects up into a commit or tag object or whatever, and making an appropriate reference name (such as a branch or tag name) record the hash ID of that object.

To make the blob you accidentally committed with H go away, then, you not only need to expire the unreachable reflog entries, but also tell Git to prune objects even if they're zero days old:

git prune --expire=now

This prune step is the part of git gc that actually removes the object, so by running git prune, you remove the need to run git gc. (git gc also runs the reflog expire and so on, but coordinates everything to make sure Git has these grace periods. Since we're bypassing all the grace periods, we just bypass git gc as well.)

Make sure no other Git commands are running when you do this, since they may be creating objects that they expect to persist for 14 days while they get their work done.

The last complication is pack files

If your secret is stored in what Git calls a loose object, the above steps suffice: the object will be completely gone, and:

git rev-parse <hash-ID>

will no longer find the object at all. It's no longer available anywhere in this Git repository.

But not all objects are loose. Eventually, to save space, Git packs these loose objects into pack files. Objects stored inside pack files get compressed against other objects in the same pack file.3 In this case, if your secret data have become packed, it's possible to retrieve them from the pack file.

This usually doesn't happen quickly so it's rare to have a just-committed secret wind up in a pack file. But if it has happened, the only way to clean it up is to make Git re-pack all the existing pack files. That is, you would have Git explode the packs into their constituent loose objects, then toss the unwanted object, then build a new (usually single) pack file—or use a process that has that effect, at least. The Git command to rebuild the packs is git repack and it has a lot of options. I'm not going to go into any more detail here as I'm out of time.


3In thin packs, objects may be compressed against other objects in the repository that are not in the pack file, but thin packs are used only for fetch and push operations, after which they're "fattened up" by adding the missing bases back.


Servers often don't have reflogs

To deal with all of this, you need to be able to log into your GitLab server(s), as none of these maintenance Git commands (nor the BFG, see below) can be invoked via fetch or push. In particular, while you can use git push -f from your client to make the name master on the server no longer point to commit H, you cannot invoke git prune to make a loose object go away.

If and when you do log into the server, you can check whether reflogs are enabled for your repository there. If not, there's no need to do any reflog expiry. You can also see whether your object is loose or packed by looking into the .git/objects directory. If your blob hash ID is, say, 0123456789... it will live in a file named .git/objects/01/23456789.... Once it's unreferenced and pruned, the file will be gone and you will be done.

Using The BFG repo cleaner

You can avoid a lot of complications by using the BFG repo cleaner. BFG does not respect any of the grace periods anyway, since it has a different purpose. That also takes care of any pack file issues. Like the other method, this must be run on the server, and it has its own quirks (see the linked question and answers).

like image 135
torek Avatar answered Dec 14 '22 23:12

torek


You could remove the sensitive data from history. As you note, any existing clone that has pulled the current history will still have the file. Those repos will have to be "fixed" to keep working with the remote (see the git rebase docs - https://git-scm.com/docs/git-rebase - under "Recovering from Upstream Rebase"). Even after the repair, the users of those repos will still be able to get at the data if they want to. (In fact, nothing would stop them from making a copy of that data before the repair, even if you did somehow have a repair process that would forcibly remove the data from their clone.)

With that in mind, you really just need to treat that data as compromised. For example, if it is a password, change the password.

And with that in mind, it's possible that a history rewrite may not be worth it. If sensitive data is of such a type that it can't be changed and all you can do is mitigate the existing leak and try to prevent it spreading further, then the history edit has value in that it keeps new clones from further exposing the data. But if it's a password, then changing the password makes it irrelevant whether the old password remains in the source history - so then It's probably not worth fixing.

If you are going to rewrite history, there are several tools you could use, depending on how much history is affected. Detailed procedures for all of these have been discussed here numerous times, but in summary:

  • If it's just the most recent commit(s) of a ref or two, then you could use git commit --amend

  • If it's a simple linear history of commits (and probably not a terribly long history), you could do an interactive rebase to edit the commit that introduced the sensitive data

  • For more complicated cases where the history isn't prohibitively large, you could use git filter-branch with either a tree-filter or an index-filter

  • There are specialized tools you could use, like the BFG Repo cleaner.

like image 42
Mark Adelsberger Avatar answered Dec 15 '22 01:12

Mark Adelsberger