Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different utf8 encoding in filenames os x

I have a small shellscript in .x

$ cat .x
u="Böhmáí"
touch "$u"
ls > .list
echo "$u" >.text

cat .list .text
diff .list .text
od -bc .list
od -bc .text

When i run this scrpit sh -x .x (-x only for showing commands)

$ sh -x .x
+ u=Böhmáí
+ touch Böhmáí
+ ls
+ echo Böhmáí
+ cat .list .text
Böhmáí
Böhmáí
+ diff .list .text
1c1
< Böhmáí
---
> Böhmáí
+ od -bc .list
0000000   102 157 314 210 150 155 141 314 201 151 314 201 012            
           B   o   ̈    **   h   m   a   ́    **   i   ́    **  \n            
0000015
+ od -bc .text
0000000   102 303 266 150 155 303 241 303 255 012                        
           B   ö  **   h   m   á  **   í  **  \n                        
0000012

The same string Böhmáí has encoded into different bytes in the filename vs as a content of a file. In the terminal (utf8-encoded) the string looks same in both variants.

Where is the rabbit?

like image 762
jm666 Avatar asked May 27 '11 14:05

jm666


People also ask

Does Mac use UTF-8?

Mac OS X uses UTF-8 as its default encoding for representing filenames/paths.

How does UTF-8 represent different types of characters?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.

Is UTF-8 the same as UTF-8?

There is no difference between "utf8" and "utf-8"; they are simply two names for UTF8, the most common Unicode encoding.


1 Answers

(This is mostly stolen from a previous answer of mine...)

Unicode allows some accented characters to be represented in several different ways: as a "code point" representing the accented character, or as a series of code points representing the unaccented version of the character, followed by the accent(s). For example, "ä" could be represented either precomposed as U+00E4 (UTF-8 0xc3a4, Latin small letter 1 with diaeresis) or decomposed as U+0061 U+0308 (UTF-8 0x61cc88, Latin small letter a + combining diaeresis).

OS X's HFS+ filesystem requires that all filenames be stored in the UTF-8 representation of their fully decomposed form. In an HFS+ filename, "ä" MUST be encoded as 0x61cc88, and "ö" MUST be encoded as 0x6fcc88.

So what's happening here is that your shell script contains "Böhmáí" in precomposed form, so it gets stored that way in the variable a, and stored that way in the .text file. But when you create a file with that name (with touch), the filesystem converts it to the decomposed form for the actual filename. And when you ls it, it shows the form the filesystem has: the decomposed form.

like image 137
Gordon Davisson Avatar answered Oct 22 '22 13:10

Gordon Davisson