Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where is the character encoding of a text file stored in Linux?

I know the short answer should be "nowhere", however there's something that doesn't quite add up in the following test 2.

Test 1. In Gedit, I create a new file containing only the string "aàbï", I choose "Save As" and there's a selector for choosing the character encoding. So I save it as "Unicode (UTF-8)", then I repeat the same and I save it to another file as "ISO-8859-15". The first file is 7 bytes in size (2 1-byte characters, 2 2-byte characters and a LF at the end of the file, as a hex dump shows). The second file is 5 bytes in size (4 1-byte characters in latin encoding plus a LF). This shows that the encoding is not stored anywhere in the file. Apparently, when I open the file in Gedit and it decodes it correctly, it must be figuring out how to decode it by analyzing the contents.

Test2. I do the same as above, but this time the contents of the file are just "abcd", that is four ascii characters. The two saved files have identical sizes (5 bytes) and identical hex dumps. It seems like the two files are identical, indistinguishable, so, again, it seems no information about the encoding is included in the files.

However, when I open the two files of test 2 again in Gedit, and I go to Save As, the encoding that the file was saved with is selected. Gedit somehow can tell that one file was encoded in UTF-8 and the other in ISO-8859-15, though both only contain ascii characters that result in the same byte sequence and they appear to be identical. How is that?

Is there some sort of metadata in the filesystem? Or is it just Gedit that has its own cache and remembers user choices for a given file that was already opened (and saved) with it on the same computer?

P.S. note that this question is related to programming even if I pose a non-programming test case, because this is about how a given type of files is encoded, whic affects how one would read, parse, decode, encode and write them from a program.

like image 237
matteo Avatar asked Mar 27 '16 20:03

matteo


People also ask

How do I change the encoding of a text file in Linux?

In Linux, the iconv command line tool is used to convert text from one form of encoding to another. Where -f or --from-code means input encoding and -t or --to-encoding specifies output encoding.

What is character encoding in Linux?

The characters encoded are numbers from '0' to '9', lowercase letters 'a' to 'z', uppercase letters 'A' to 'Z', basic punctuation symbols, control codes that originated with Teletype machines, and the space (also known as white space). For example, lowercase 'j' would become binary 1101010 and decimal 106.

How do I change the encoding in Linux?

iconv command is used to convert some text in one encoding into another encoding. If no input file is provided then it reads from standard input. Similarly, if no output file is given then it writes to standard output.


1 Answers

It isn't, at least not by default. There's actually no difference between the way those two files containing abcd are stored in the filesystem, since the text string abcd is encoded identically in the ASCII subset of both locales.

Ext filesystems do not log file encoding metadata. While it is possible to record a limited amount of data (on the order of a few kilobytes) along with a file on an ext filesystem by using extended attributes, gedit apparently does not use this to store character encoding, and instead caches a specific user's selected encoding for specific files. You can demonstrate this by logging in as another user (I logged in as root for this experiment) and opening the same file -- gedit will read it using the default system locale, not the custom locale that you saved it in under the other login.

like image 153
sig_seg_v Avatar answered Nov 05 '22 21:11

sig_seg_v