Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't GIT natively support UTF-16

Tags:

git

utf-16

Git supports several different encoding schemes, UTF-7, UTF-8, and UTF-32, as well as non-UTF ones.

Given this, why doesn't it support UTF-16?

There's a lot of questions that ask how to get Git to support UTF-16, but I don't think that this has been explicitly asked or answered yet.

like image 445
Zac Faragher Avatar asked Sep 24 '18 03:09

Zac Faragher


2 Answers

I devote a significant chunk of a full chapter of my (currently rather moribund) book (see Chapter 3, which is in better shape than later chapters) to the issue of character encoding, because it is a historical mess. It's worth mentioning here, though, that part of the premise of this question—that Git supports UTF-7 and UTF-32 in some way—is wrong: UTF-7 is a standard that never even came about and should probably never be used at all (so naturally, older Internet Explorer versions do, and this leads to the security issue mentioned on the linked Wikipedia page).

That said, let's first separate character encoding from code pages. (See footnote-ish section below as well.) The fundamental problem here is that computers—well, modern ones anyway—work with a series of 8-bit bytes, with each byte representing an integer in the range [0..255]. Older systems had 6, 7, 8, and even 9-bit bytes, though I think calling anything less than 8 bits a "byte" is misleading. (BBN's "C machines" had 10-bit bytes!) In any case, if one byte represents one character-symbol, this gives us an upper limit of 256 kinds of symbols. In those bad old days of ASCII, that was sufficient, since ASCII had just 128 symbols, 33 of them being non-printing symbols (control codes 0x00 through 0x1f, plus 0x7f representing DEL or a deleted punch on paper tape, writing them in hexadecimal here).

When we needed more than 94 printable symbols plus the space (0x20), we—by we I mean people using computers all over the world, not specifically me—said: Well, look at this, we have 128 unused encodings, 0x80 through 0xff, let's use some of those! So the French used some for ç and é and so on, and punctuation like « and ». The Czechs needed one for Z-with-caron, ž. The Russians needed lots, for Cyrillic. The Greeks needed lots, and so on. The result was that the upper half of the 8-bit space exploded into many incompatible sets, which people called code pages.

Essentially, the computer stores some eight-bit byte value, such as 235 decimal (0xEB hex), and it's up to something else—another computer program, or ultimately a human staring at a screen, to interpret that 235 as, say, a Cyrillic л character, or a Greek λ, or whatever. The code page, if we are using one, tells us what "235" means: what sort of semantics we should impose on this.

The problem here is that there is a limit on how many character codes we can support. If we want to have the Cyrillic L (л) coexist with the Greek L (lambda, λ), we can't use both CP-1251 and CP-1253 at the same time, so we need a better way to encode the symbol. One obvious way is to stop using one-byte values to encode symbols: if we use two-byte values, we can encode 65536 values, 0x0000 through 0xffff inclusive; subtract a few for control codes and there is still room for many alphabets. However, we rapidly blew through even this limit, so we went to Unicode, which has room for 1,114,112 of what it calls code points, each of which represents some sort of symbol with some sort of semantic meaning. Somewhat over 100,000 of these are now in use, including Emoji like 😀 and 😱.

Encoding Unicode into bytes or words

This is where UTF-8, UTF-16, UTF-32, UCS-2, and UCS-4 all come in. These are all schemes for encoding Unicode code points—one of those ~1 million values—into byte-streams. I'm going to skip over the UCS ones entirely and look only at the UTF-8 and UTF-16 encodings, since those are the two that are currently the most interesting. (See also What are Unicode, UTF-8, and UTF-16?)

The UTF-8 encoding is straightforward: any code point whose decimal value is less than 128 is encoded as a byte containing that value. This means that ordinary ASCII text characters remain ordinary ASCII text characters. Code points in 0x0080 (128 decimal) through 0x07ff (2047 decimal) encode into two bytes, both of whose value is in the 128-255 range and hence distinguishable from a one-byte encoded value. Code points in the 0x0800 through 0xffff range encode into three bytes in that same 128-255 range, and the remaining valid values encode into four such bytes. The key here as far as Git itself is concerned is that no encoded value resembles an ASCII NUL (0x00) or slash (0x2f).

What this UTF-8 encoding does is to allow Git to pretend that text strings—and especially file names—are slash-separated name components whose ends are, or can be anyway, marked with ASCII NUL bytes. This is the encoding that Git uses in tree objects, so UTF-8 encoded tree objects just fit, with no fiddling required.

UTF-16 encoding uses two paired bytes per character. This has two problems for Git and pathnames. First, a byte within a pair might accidentally resemble /, and all ASCII-valued characters necessarily encode as a pair of bytes where one byte is 0x00 which resembles ASCII NUL. So Git would need to know: this path name has been encoded in UTF-16 and work on byte-pairs. There's no room in a tree object for this information, so Git would need a new object type. Second, whenever we break a 16-bit value into two separate 8-bit bytes, we do this in some order: I either give you the more more significant byte first, then the less significant byte; or I give you the less significant byte first, then the more significant one. This second problem leads to the reason that UTF-16 has byte order marks. UTF-8 needs no byte order mark, and suffices, so why not use that in trees? So Git does.

That's fine for trees, but we also have commits, tags, and blobs

Git does its own interpretation of three of these four kinds of objects:

  1. Commits contain hash IDs.
  2. Trees contain path names, file modes, and hash IDs.
  3. Tags contain hash IDs.

The one that's not listed here is the blob, and for the most part, Git does not do any interpretation of blobs.

To make it easy to understand the commits, trees, and tags, Git constrains all three to be in UTF-8 for the most part. However, Git does allow the log message in a commit, or the tag text in a tag, to go somewhat (mostly) uninterpreted. These come after the header that Git interprets, so even if there is something particularly tricky or ugly at this point, that's pretty safe. (There are some minor risks here since PGP signatures, which appear below the headers, do get interpreted.) For commits in particular, modern Git will include an encoding header line in the interpreted section, and Git can then attempt to decode the commit message body, and re-encode it into whatever encoding is used by whatever program is interpreting the bytes that Git spits out.1

The same rules could work for annotated tag objects. I'm not sure if Git has code to do that for tags (the commit code could mostly be re-used, but tags much more commonly have PGP signatures, and it's probably wiser just to force UTF-8 here). Since trees are internal objects, their encoding is largely invisible anyway—you do not need to be aware of this (except for the issues that I point out in my book).

This leaves blobs, which are the big gorilla.


1This is a recurring theme in the computing world: everything is repeatedly encoded and decoded. Consider how something arrives over Wi-Fi or a cable network connection: it's been encoded into some sort of radio wave or similar, and then some hardware decodes that into a bit-stream, which some other hardware re-encodes into a byte stream. Hardware and/or software strip off headers, interpret the remaining encoding in some way, change the data appropriately, and re-encode the bits and bytes, for another layer of hardware and software to deal with. It's a wonder anything ever gets done.


Blob encoding

Git likes to claim that it's entirely agnostic to the actual data stored in your files, as Git blobs. This is even mostly true. Or, well, half true. Or something. As long as all Git is doing is storing your data, it's completely true! Git just stores bytes. What those bytes mean is up to you.

This story falls apart when you run git diff or git merge, because the diff algorithms, and hence the merge code, are line-oriented. Lines are terminated with newlines. (If you're on a system that uses CRLF instead of newline, well, the second character of a CRLF pair is a newline, so there's no problem here—and Git is OK with an unterminated final line, though this causes some minor bits of heartburn here and there.) If the file is encoded in UTF-16, a lot of bytes tend to appear to be ASCII NULs, so Git just treats it as binary.

This is fixable: Git could decode the UTF-16 data into UTF-8, feed that data through all of its existing line-oriented algorithms (which would now see newline-terminated lines), and then re-encode the data back to UTF-16. There are a bunch of minor technical issues here; the biggest is deciding that some file is UTF-16, and if so, which endianness (UTF-16-LE, or UTF-16-BE?). If the file has a byte order marker, that takes care of the endian issue, and UTF-16-ness could be coded into .gitattributes just as you can currently declare files binary or text, so it's all solvable. It's just messy, and no one has done this work yet.

Footnote-ish: code pages can be considered a (crappy) form of encoding

I mentioned above that the thing we do with Unicode is to encode a 21-bit code point value in some number of eight-bit bytes (1 to 4 bytes in UTF-8, 2 bytes in UTF-16—there's an ugly little trick with what UTF-16 calls surrogates to squeeze 21 bits of value into 16 bits of container, occasionally using pairs of 16-bit values, here). This encoding trick means we can represent all legal 21-bit code point values, though we may need multiple 8-bit bytes to do so.

When we use a code page (CP-number), what we're doing is, or at least can be viewed as, mapping 256 values—those that fit into one 8-bit byte—into that 21-bit code point space. We pick out some subset of no more than 256 such code points and say: These are the code points we'll allow. We encode the first one as, say, 0xa0, the second as 0xa1, and so on. We always leave room for at least a few control codes—usually all 32 in the 0x00 through 0x1f range—and usually we leave the entire 7-bit ASCII subset, as Unicode itself does (see List of Unicode characters), which is why we most typically start at 0xa0.

When one writes proper Unicode support libraries, code pages simply become translation tables, using just this form of indexing. The hard part is making accurate tables for all the code pages, of which there are very many.

The nice thing about code pages is that characters are once again one-byte-each. The bad thing is that you choose your symbol set once, when you say: I use this code page. From then on, you are locked into this small subset of Unicode. If you switch to another code page, some or all of your eight-bit byte values represent different symbols.

like image 109
torek Avatar answered Sep 27 '22 21:09

torek


Git recently has begun to understand encodings such as UTF-16. See gitattributes documentation—search for working-tree-encoding.

If you want .txt files to be UTF-16 without a BOM on Windows machine then add this to your gitattributes file:

*.txt text working-tree-encoding=UTF-16LE eol=CRLF

In response to jthill's comments:

There isn't any doubt that UTF-16 is a mess. However, consider

  • Java uses UTF16

  • As does Microsoft

    Note the line UTF16… the one used for native Unicode encoding on Windows operating systems

  • JavaScript uses a mess between UCS-2 and UTF-16

like image 44
Rusi Avatar answered Sep 27 '22 22:09

Rusi