Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to substitute words in Git history & properly debug related problems?

I'm trying to remove sensitive data like passwords from my Git history. Instead of deleting whole files I just want to substitute the passwords with removedSensitiveInfo. This is what I came up with after browsing through numerous StackOverflow topics and other sites.

git filter-branch --tree-filter "find . -type f -exec sed -Ei '' -e 's/(aSecretPassword1|aSecretPassword2|aSecretPassword3)/removedSensitiveInfo/g' {} \;"

When I run this command it seems to be rewriting the history (it shows the commits it's rewriting and takes a few minutes). However, when I check to see if all sensitive data has indeed been removed it turns out it's still there.

For reference this is how I do the check

git grep aSecretPassword1 $(git rev-list --all)

Which shows me all the hundreds of commits that match the search query. Nothing has been substituted.

Any idea what's going on here?

I double checked the regular expression I'm using which seems to be correct. I'm not sure what else to check for or how to properly debug this as my Git knowledge quite rudimentary. For example I don't know how to test whether 1) my regular expression isn't matching anything, 2) sed isn't being run on all files, 3) the file changes are not being saved, or 4) something else.

Any help is very much appreciated.

P.S. I'm aware of several StackOverflow threads about this topic. However, I couldn't find one that is about substituting words (rather than deleting files) in all (ASCII) files (rather than specifying a specific file or file type). Not sure whether that should make a difference, but all suggested solutions haven't worked for me.

like image 312
Marc Avatar asked Dec 12 '22 10:12

Marc


2 Answers

git-filter-branch is a powerful but difficult to use tool - there are several obscure things you need to know to use it correctly for your task, and each one is a possible cause for the problems you're seeing. So rather than immediately trying to debug them, let's take a step back and look at the original problem:

  • Substitute given strings (ie passwords) within all text files (without specifying a specific file/file-type)
  • Ensure that the updated Git history does not contain the old password text
  • Do the above as simply as possible

There is a tailor-made solution to this problem:

Use The BFG... not git-filter-branch

The BFG Repo-Cleaner is a simpler alternative to git-filter-branch specifically designed for removing passwords and other unwanted data from Git repository history.

Ways in which the BFG helps you in this situation:

  • The BFG is 10-720x faster
  • It automatically runs on all tags and references, unlike git-filter-branch - which only does that if you add the extraordinary --tag-name-filter cat -- --all command-line option (Note that the example command you gave in the Question DOES NOT have this, a possible cause of your problems)
  • The BFG doesn't generate any refs/original/ refs - so no need for you to perform an extra step to remove them
  • You can express you passwords as simple literal strings, without having to worry about getting regex-escaping right. The BFG can handle regex too, if you really need it.

Using the BFG

Carefully follow the usage steps - the core bit is just this command:

$ java -jar bfg.jar  --replace-text replacements.txt  my-repo.git

The replacements.txt file should contain all the substitutions you want to do, in a format like this (one entry per line - note the comments shouldn't be included):

PASSWORD1 # Replace literal string 'PASSWORD1' with '***REMOVED***' (default)
PASSWORD2==>examplePass         # replace with 'examplePass' instead
PASSWORD3==>                    # replace with the empty string
regex:password=\w+==>password=  # Replace, using a regex

Your entire repository history will be scanned, and all text files (under 1MB in size) will have the substitutions performed: any matching string (that isn't in your latest commit) will be replaced.

Full disclosure: I'm the author of the BFG Repo-Cleaner.

like image 90
Roberto Tyley Avatar answered Mar 05 '23 15:03

Roberto Tyley


Looks OK. Remember that filter-branch retains the original commits under refs/original/, e.g.:

$ git commit -m 'add secret password, oops!'
[master edaf467] add secret password, oops!
 1 file changed, 4 insertions(+)
 create mode 100644 secret
$ git filter-branch --tree-filter "find . -type f -exec sed -Ei '' -e 's/(aSecretPassword1|aSecretPassword2|aSecretPassword3)/removedSensitiveInfo/g' {} \;"
Rewrite edaf467960ade97ea03162ec89f11cae7c256e3d (2/2)
Ref 'refs/heads/master' was rewritten

Then:

$ git grep aSecretPassword `git rev-list --all`
edaf467960ade97ea03162ec89f11cae7c256e3d:secret:aSecretPassword2

but:

$ git lola
* e530e69 (HEAD, master) add secret password, oops!
| * edaf467 (refs/original/refs/heads/master) add secret password, oops!
|/  
* 7624023 Initial

(git lola is my alias for git log --graph --oneline --decorate --all). Yes, it's in there, but under the refs/original name space. Clear that out:

$ rm -rf .git/refs/original
$ git reflog expire --expire=now --all
$ git gc
Counting objects: 6, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 0), reused 0 (delta 0)

and then:

$ git grep aSecretPassword `git rev-list --all`
$ 

(as always, run filter-branch on a copy of the repo Just In Case; and then removing original refs, expiring the reflog "now", and gc'ing, means stuff is Really Gone).

like image 26
torek Avatar answered Mar 05 '23 15:03

torek