Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Git: Diff does not handles character encoding other than UTF-8?

Created a repo, added UTF8 and Latin2 encoded files with this content:

árvíztűrő tükörfúrógép
ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP

See on https://github.com/bimlas/git-test/commit/872370caf91f1faaf931c1228c797f3d10d6435d

The output of git log -p 82904e60 is:

commit 82904e60d1940c036c8190e2a41de6b423727a7c
Author: BimbaLaszlo <[email protected]>
Date:   Mon Jul 27 14:38:35 2015 +0200

    initial commit

diff --git a/fileencoding/latin2.txt b/fileencoding/latin2.txt
new file mode 100644
index 0000000..7165bc9
--- /dev/null
+++ b/fileencoding/latin2.txt
@@ -0,0 +1,2 @@
+<E1>rv<ED>zt<FB>r<F5> t<FC>k<F6>rf<FA>r<F3>g<E9>p^M
+<C1>RV<CD>ZT<DB>R<D5> T<DC>K<D6>RF<DA>R<D3>G<C9>P^M
diff --git a/fileencoding/utf8.txt b/fileencoding/utf8.txt
new file mode 100644
index 0000000..80e1878
--- /dev/null
+++ b/fileencoding/utf8.txt
@@ -0,0 +1,2 @@
+árvíztűrő tükörfúrógép^M
+ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP^M

I've git the same output on Linux and Windows (where my locale is Latin2). Tried without pager (git --no-pager log -p 82904e60), got the same results without escape codes:

commit 82904e6
Author: BimbaLaszlo <[email protected]>
Date:   2015-07-27 14:38:35 +0200

    initial commit

diff --git a/fileencoding/latin2.txt b/fileencoding/latin2.txt
new file mode 100644
index 0000000..7165bc9
--- /dev/null
+++ b/fileencoding/latin2.txt
@@ -0,0 +1,2 @@
+�rv�zt�r� t�k�rf�r�g�p
+�RV�ZT�R� T�K�RF�R�G�P
diff --git a/fileencoding/utf8.txt b/fileencoding/utf8.txt
new file mode 100644
index 0000000..80e1878
--- /dev/null
+++ b/fileencoding/utf8.txt
@@ -0,0 +1,2 @@
+árvíztűrő tükörfúrógép
+ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP

The log of the latin2.txt is the same, so the problem is not caused by mix of differently encoded files in one output.

How can I set up Git to print the characters as they should appear even without pager?

EDIT

I think the problem is not related to the terminal, for example on Windows PowerShell the latin2.txt is fine, but utf8.txt is weird:

Same encoding with different output

like image 385
bimlas Avatar asked Apr 08 '16 07:04

bimlas


1 Answers

Git does not really care about character encodings at all. A file is just a bunch of bytes.

Displaying is done by your terminal. If it is configured to decode as UTF-8 your latin-2 file seems broken. If it is configured to decode as latin-2 you UTF-8 file seems broken.

Maybe the encoding attribute (see git help gitattributes) is able to give some tools a hint how to decode a file correctly, but I never used this. For example github might be smart enough to look at this attribute and decode those files differently.

like image 80
michas Avatar answered Nov 09 '22 23:11

michas