I'm going to convert a large Mercurial project to Git this weekend using fast-export. I've tested that several times, and the results are good. We'd also like to turn our source code encoding (lots of German comments/string literals with Umlauts) from ISO-8859-1 to UTF-8 (all other non-java files in the repo should stay as-is), and the Git migration delivers us a chance to do it now since everybody needs to clone again anyway. However, I don't find a good approach for it. <ol> <li>I've tried the <code>git filter-tree --tree-filter ...</code> approach from this comment on SO. However while this seems ideal, due to the size of the repository (about 200000 commits, 18000 code files) it would take much more time than just the weekend I have. I've tried running it (in a heavily optimized version where the list of files is chunked and the sublists are converted in parallel (using GNU parallel)) straight from a 64GB tmpfs volume on a linux VM with 72 cores, and still it would take several days...</li> <li>Alternatively, I've tried the simple approach where I perform the conversion simply on any active branch individually and commit the changes. However, the result is not satisfying because then I almost always get conflicts when merging or cherry-picking pre-conversion commits.</li> <li>Now I'm running approach 1 again but not trying to rewrite the complete history of all branches (<code>--all</code> as <code><rev-list></code>) but just all commits reachable from the current active branches' and not reachable by some past commit which is (hopefully) a predecessor of all current branches (<code>branch-a branch-b branch-c --not old-tag-before-branch-a-b-c-forked-off</code> as <code><rev-list></code>). It's still running but I fear that I can't really trust the results as this seems like a very bad idea.</li> <li>We could just switch the encoding in the master branch with a normal commit as in approach 2, but again this would make cherry-picking fixes from/to master a disaster. And it would introduce lots of encoding problems because developers would surely forget to change their IDE settings when switching between master and non-converted branches.</li> </ol> So right now, I somehow feel the best solution could be to just stick to ISO-8859-1. Does anyone have an idea? Someone mentioned that maybe reposurgeon can do basically approach 1 using its <code>transcode</code> operation with a performance much better than <code>git filter-tree --tree-filter ...</code> but I have no clue how that works.

I had the exact same problem and the solution is based in @kostix answer of using as the basis the <code>--index-filter</code> option of <code>filter-branch</code>, but, with some additional improvements. <ol> <li>Use <code>git diff --name-only --staged</code> to detect the contents of the staging area</li> <li>Iterate over this list and filter for: <ol> <li> <code>git ls-files $filename</code>, i.e., it isn't a deleted file</li> <li>the result of <code>git show ":0:$filename" | file - --brief --mime-encoding</code> isn't <code>binary</code>, i.e., it is a text file, nor is already UTF-8 encoded</li> </ol> </li> <li>Use the detected mime encoding for each file</li> <li>Use iconv to convert the the files</li> <li>Detect the file mode with <code>git ls-files $filename --stage | cut -c 1-6</code> </li> </ol> This is the look of my bash function: <pre class="prettyprint"><code>changeencoding() { for filename in `git diff --name-only --staged`; do # Only if file is present, i.e., filter deletions if [ `git ls-files $filename` ]; then local encoding=`git show ":0:$filename" | file - --brief --mime-encoding` if [ "$encoding" != "binary" -a "$encoding" != "utf-8" ]; then local sha1=`git show ":0:$filename" \ | iconv --from-code=$encoding --to-code=utf-8 \ | git hash-object -t blob -w --stdin` local mode=`git ls-files $filename --stage | cut -c 1-6` git update-index --cacheinfo "$mode,$sha1,$filename" --info-only fi fi done } </code></pre>

Switching a Git repository from ISO-8859-1 to UTF-8 encoding for source code files

Tags:

git

encoding

utf-8

iso-8859-1

reposurgeon

I'm going to convert a large Mercurial project to Git this weekend using fast-export. I've tested that several times, and the results are good.

We'd also like to turn our source code encoding (lots of German comments/string literals with Umlauts) from ISO-8859-1 to UTF-8 (all other non-java files in the repo should stay as-is), and the Git migration delivers us a chance to do it now since everybody needs to clone again anyway. However, I don't find a good approach for it.

I've tried the git filter-tree --tree-filter ... approach from this comment on SO. However while this seems ideal, due to the size of the repository (about 200000 commits, 18000 code files) it would take much more time than just the weekend I have. I've tried running it (in a heavily optimized version where the list of files is chunked and the sublists are converted in parallel (using GNU parallel)) straight from a 64GB tmpfs volume on a linux VM with 72 cores, and still it would take several days...
Alternatively, I've tried the simple approach where I perform the conversion simply on any active branch individually and commit the changes. However, the result is not satisfying because then I almost always get conflicts when merging or cherry-picking pre-conversion commits.
Now I'm running approach 1 again but not trying to rewrite the complete history of all branches (--all as <rev-list>) but just all commits reachable from the current active branches' and not reachable by some past commit which is (hopefully) a predecessor of all current branches (branch-a branch-b branch-c --not old-tag-before-branch-a-b-c-forked-off as <rev-list>). It's still running but I fear that I can't really trust the results as this seems like a very bad idea.
We could just switch the encoding in the master branch with a normal commit as in approach 2, but again this would make cherry-picking fixes from/to master a disaster. And it would introduce lots of encoding problems because developers would surely forget to change their IDE settings when switching between master and non-converted branches.

So right now, I somehow feel the best solution could be to just stick to ISO-8859-1.

Does anyone have an idea? Someone mentioned that maybe reposurgeon can do basically approach 1 using its transcode operation with a performance much better than git filter-tree --tree-filter ... but I have no clue how that works.

443

asked Jun 08 '18 06:06

Tassilo Horn

2 Answers

A tree filter in git filter-branch is inherently slow. It works by extracting every commit into a full blown tree in a temporary directory, letting you change every file, and then figuring out what you changed and making the new commit from every file you left behind.

If you're exporting and importing through fast-export / fast-import, that would be the time to convert the data: you have the expanded data of the file in memory, but not in file-system form, before writing it to the export/import pipeline. Moreover, git fast-import itself is a shell script so it's trivial to insert filtering there, and hg-fast-export is a Python program so it's trivial to insert filtering there as well. The obvious place would be here: just re-encode d.

186

answered Nov 15 '22 01:11

torek

I had the exact same problem and the solution is based in @kostix answer of using as the basis the --index-filter option of filter-branch, but, with some additional improvements.

Use git diff --name-only --staged to detect the contents of the staging area
Iterate over this list and filter for:
1. git ls-files $filename, i.e., it isn't a deleted file
2. the result of git show ":0:$filename" | file - --brief --mime-encoding isn't binary, i.e., it is a text file, nor is already UTF-8 encoded
Use the detected mime encoding for each file
Use iconv to convert the the files
Detect the file mode with git ls-files $filename --stage | cut -c 1-6

This is the look of my bash function:

changeencoding() {
    for filename in `git diff --name-only --staged`; do
        # Only if file is present, i.e., filter deletions
        if [ `git ls-files $filename` ]; then
            local encoding=`git show ":0:$filename" | file - --brief --mime-encoding`
            if [ "$encoding" != "binary" -a  "$encoding" != "utf-8" ]; then
                local sha1=`git show ":0:$filename" \
                    | iconv --from-code=$encoding --to-code=utf-8 \
                    | git hash-object -t blob -w --stdin`
                local mode=`git ls-files $filename --stage | cut -c 1-6`
                git update-index --cacheinfo "$mode,$sha1,$filename" --info-only
            fi
        fi
    done
}

answered Nov 15 '22 00:11

euluis

Related questions
                            
                                How can I merge a branch without closing the issue in Gitlab?
                            
                                Use different Git repo for different modules
                            
                                Lost all commits when forced push in github
                            
                                Why do developers commit .dist files instead of the actual file?
                            
                                Git for Windows "No tags file" Response from "git diff" Command
                            
                                How to use git namespace to hide branches
                            
                                Git repos in extra-deps
                            
                                Cannot push to my github private repository
                            
                                Terraform not respecting ssh config for git
                            
                                How to sort the output of git branch by most recent checkout
                            
                                How to find merge commits in history done using "git merge -s ours" option?
                            
                                My VSTS pull request shows diffs between two branches when none exist
                            
                                Git: `rebase -i` doesn't work (it doesn't open `.git/rebase-merge/git-rebase-todo`)
                            
                                Force git to always ask for a password
                            
                                Git interactive rebase, edit particular commit without needing to use editor?
                            
                                git bundle verify crash
                            
                                Dark theme for git gui
                            
                                Does circleci checkout pull the latest code from master or the code for the specific commit
                            
                                how to enable color highlighting for interactive git rebase on mac
                            
                                GitKraken won't open

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With