Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Make git diff show UTF8 encoded characters properly

I have a file with Swedish characters in it (åäö) encoded with UTF8.

If I cat the file it displays fine, but if I do git diff the special characters are printed, for example, as <F6>.

Example git diff output:

-            name: 'Magler<F6>d, S<F6>der<E5>sen', 

What I wanted to see:

-            name: 'Magleröd, Söderåsen', 

I found another question related to git and encoding problems: git, msysgit, accents, utf-8, the definitive answers It says all problems should be fixed in git version 1.7.10. I have version 1.8.1.2

What can I do to make git diff properly display åäö?

like image 417
Tobbe Avatar asked Oct 17 '13 19:10

Tobbe


People also ask

Can UTF-8 represent all characters?

Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.

How do I change my UTF-8 character set?

Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.

What characters are not allowed in UTF-8?

0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text.

Is UTF-8 backwards compatible?

UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. All other characters use two to four bytes.


2 Answers

git is dumping out raw bytes. In this case, it doesn't care what your file's encoding is. The highlighted <F6> you're seeing is coming from less, which is presumably configured as your PAGER. Try setting:

LESSCHARSET=UTF-8 
like image 53
Edward Thomson Avatar answered Sep 18 '22 17:09

Edward Thomson


@matt and @twalberg were correct. The file wasn't actually UTF-8 encoded. Trying to figure this out wasn't helped by the fact that my terminal (hterm) can't input åäö properly (but it can display and copy/paste them)...

iconv -f ISO-8859-1 -t UTF-8 in.txt > out_utf-8.txt 

solved my issue

like image 21
Tobbe Avatar answered Sep 21 '22 17:09

Tobbe