Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove spam from git history

I have "inherited" a dirty git repository with about 5k valid commits and about 50k spam commits (this is the edit history for something that used to be a world-writable wiki). We're migrating formats so this is a good time to rewrite history. I don't want to loose the history entirely, but both by commit volume and raw content volume the spam is overwhelming. The old moderation technique of rolling back to the last good commit left a lot of junk.

I can find about 80% of the bad commits without too much trouble using git log -S and some regular expression work. Most of the spam content is pretty obvious. The problem is I'm not sure what do to with the massive list of commits I want to drop.

Note I'm quite familiar with git and use git rebase hourly (that would have been minutely except git revise has taken over a lot of the load), and I know how to accomplish this manually, but I need an automated solution. Normally I would turn to git filter-branch, but I'm not sure what tool to reach for to inspect the current diff.

I thought about writing a script to manipulate a rebase script, but I think that's going to get me in trouble with false positives. I can probably catch and drop both the original defacing and the rollback, but what happens when I miss one side of that equation? I want the REST of the possible matches to succeed not fail when one of them doesn't rebase cleanly.

Note I don't want to manipulate the contents of files or add/remove files based on my matches, I want to inspect the content of the patch and decide to pick or drop based on that.

What's the best git tool to reach for?

like image 451
Caleb Avatar asked Aug 13 '19 13:08

Caleb


People also ask

How do I clear my git repository history?

In order to do so, run : rm -rf . *git command which will delete any file ending with . git. 2) Back out to parent directory and run git init which will initialize .

How do I clean up commit history?

If you have been lazily writing multiple vague commits, you can use git reset --soft <old-commit> to make your branch point to that old commit. And as we learned, Git will start by moving the branch pointer to it and stops right there. It won't modify the index or working directory.

Can git history be rewritten?

There are many ways to rewrite history with git. Use git commit --amend to change your latest log message. Use git commit --amend to make modifications to the most recent commit. Use git rebase to combine commits and modify history of a branch.

How do I remove a commit from git log?

To remove the last commit from git, you can simply run git reset --hard HEAD^ If you are removing multiple commits from the top, you can run git reset --hard HEAD~2 to remove the last two commits.


1 Answers

One possibility is usage of Git's graftfile or git replace. First, identify all "good" commits, i.e. the non-spam commits, including also the "cleanup/revert" commits. For instance by filtering your history by committer email or similar mechanism (you mentioned pickaxe/-S).

Once you have the list of "good" commits, a simple transformation with the paste command gives you the content of the graftsfile, which is:

commit parent1 parent2 parent3...

Say, your good commits are as follows (newest commits on top):

b3fb1155cd5352da674d93ce4b0a1567674f6d27
b460ef0aea564e587e5866107c0fc52adf552ca1
9f803dd18c89e13f47170e1ace1d0abb992cfeee

then you need the following content in your graftsfile:

b3fb1155cd5352da674d93ce4b0a1567674f6d27 b460ef0aea564e587e5866107c0fc52adf552ca1
b460ef0aea564e587e5866107c0fc52adf552ca1 9f803dd18c89e13f47170e1ace1d0abb992cfeee

Which is fairly easy to obtain via:

sed 1d commits | paste commits - | sed '$d'

Move this file to .git/info/grafts and verify the resulting history with git log or gitk. If you are satisfied with the result, use git filter-branch to rewrite the history and persist your graftsfile. You can then remove .git/info/grafts.

See https://stackoverflow.com/a/3811217/112968 for how to use the non-deprecated replace mechanism. Using the graftsfile is easier to explain in this situation (and it still works with current Git versions, so why not use it? :))

like image 63
knittl Avatar answered Dec 27 '22 04:12

knittl