Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Git gets confused with ä in file name

I'm in a bad git situation because of a filename with an ä. It's an old file that probably has been there for ages:

So it's marked as untracked with \303\244 but then if I remove it, it's instead marked as deleted, but with \314\210. Very confusing. I don't really care about the file, but want to know for the future…

~/d/p/uniply ❯❯❯ git status                                                                           master ◼
On branch master
Your branch is up-to-date with 'origin/master'.
Untracked files:
  (use "git add <file>..." to include in what will be committed)

        "deployment/ec2/Prods\303\244ttning"

nothing added to commit but untracked files present (use "git add" to track)
~/d/p/uniply ❯❯❯ rm deployment/ec2/Prodsättning                                                       master ◼
~/d/p/uniply ❯❯❯ git status                                                                           master ✖
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        deleted:    "deployment/ec2/Prodsa\314\210ttning"

no changes added to commit (use "git add" and/or "git commit -a")
~/d/p/uniply ❯❯❯ git checkout -- deployment/ec2                                                       master ✖
~/d/p/uniply ❯❯❯ git status                                                                           master ◼
On branch master
Your branch is up-to-date with 'origin/master'.
Untracked files:
  (use "git add <file>..." to include in what will be committed)

        "deployment/ec2/Prods\303\244ttning"

nothing added to commit but untracked files present (use "git add" to track)
like image 559
Viktor Hedefalk Avatar asked Mar 10 '23 02:03

Viktor Hedefalk


1 Answers

Short version: You’re clearly using a Mac, which converts all filenames to NFD, and git used to blindly treat filenames as bytes but now converts filenames to NFC on Mac for better compatibility with other systems. As a result, old paths in commits will behave strangely.

$ python3
>>> import unicodedata
>>> unicodedata.normalize('NFC', b'a\314\210'.decode()).encode()
b'\xc3\xa4'
>>> unicodedata.normalize('NFD', b'\303\244'.decode()).encode()
b'a\xcc\x88'

The full names for these formats are Normalization Form D (Canonical Decomposition) and Normalization Form C (Canonical Decomposition, followed by Canonical Composition), and they are defined in UAX #15.

Similar things can happen on case-insensitive filesystems — try checking out the Linux kernel tree on a Windows or Mac! — with the exception that you might expect to find a few repos containing both Makefile and makefile, but nobody in their right mind would check in files named both a\314\210 and \303\244, at least not deliberately.

The core problem is that the operating system makes the same file appear under different names, so git sees something different depending on what it’s looking for, if what it’s looking for is not the default name that the operating system is presenting.

Here’s how that path would behave today, starting fresh:

$ git init 
Initialized empty Git repository
$ git config --get core.precomposeUnicode
true  # this is the default in git 1.8.5 and higher
$ touch Prodsättning 
$ env -u LANG /bin/ls -b
Prodsa\314\210ttning
$ git status -s 
?? "Prods\303\244ttning"

By using ls in C locale, I can see the bytes in the filename, which contains the decomposed values. But git is composing the character into a single code point, so that users on different platforms will not produce different results. The patch that introduced precomposed unicode explains in detail what happens for various git commands.

If two files in a commit have the same name up to Unicode normalization (or case folding), then they will appear to "fight" when git checks out the files:

$ git clone https://gist.github.com/jleedev/228395a4378a75f9e630b989c346f153 
$ git reset --hard && git status -s 
HEAD is now at fe1abe4 
 M "Prods\303\244ttning"
$ git reset --hard && git status -s 
HEAD is now at fe1abe4 
 M "Prodsa\314\210ttning"

So, if you just want to remove the file, you can proceed as you like. If you want to reliably manipulate these files, look at setting the core.precomposeUnicode option to false, so that git will store exactly the filename bytes you tell it, but that is probably more trouble than it’s worth. I might suggest creating a commit that converts all the filenames to NFC so that git will not think a file is missing.

There are some older answers to this question at Git and the Umlaut problem on Mac OS X, but many of them predate git’s ability to normalize Unicode, and setting core.quotepath=false will only cause confusion in this case.

like image 59
Josh Lee Avatar answered Mar 15 '23 19:03

Josh Lee