Where is the character encoding of a text file stored in Linux?

Tags:

I know the short answer should be "nowhere", however there's something that doesn't quite add up in the following test 2.

Test 1. In Gedit, I create a new file containing only the string "aàbï", I choose "Save As" and there's a selector for choosing the character encoding. So I save it as "Unicode (UTF-8)", then I repeat the same and I save it to another file as "ISO-8859-15". The first file is 7 bytes in size (2 1-byte characters, 2 2-byte characters and a LF at the end of the file, as a hex dump shows). The second file is 5 bytes in size (4 1-byte characters in latin encoding plus a LF). This shows that the encoding is not stored anywhere in the file. Apparently, when I open the file in Gedit and it decodes it correctly, it must be figuring out how to decode it by analyzing the contents.

Test2. I do the same as above, but this time the contents of the file are just "abcd", that is four ascii characters. The two saved files have identical sizes (5 bytes) and identical hex dumps. It seems like the two files are identical, indistinguishable, so, again, it seems no information about the encoding is included in the files.

However, when I open the two files of test 2 again in Gedit, and I go to Save As, the encoding that the file was saved with is selected. Gedit somehow can tell that one file was encoded in UTF-8 and the other in ISO-8859-15, though both only contain ascii characters that result in the same byte sequence and they appear to be identical. How is that?

Is there some sort of metadata in the filesystem? Or is it just Gedit that has its own cache and remembers user choices for a given file that was already opened (and saved) with it on the same computer?

P.S. note that this question is related to programming even if I pose a non-programming test case, because this is about how a given type of files is encoded, whic affects how one would read, parse, decode, encode and write them from a program.

237

asked Mar 27 '16 20:03

matteo

1 Answers

It isn't, at least not by default. There's actually no difference between the way those two files containing abcd are stored in the filesystem, since the text string abcd is encoded identically in the ASCII subset of both locales.

Ext filesystems do not log file encoding metadata. While it is possible to record a limited amount of data (on the order of a few kilobytes) along with a file on an ext filesystem by using extended attributes, gedit apparently does not use this to store character encoding, and instead caches a specific user's selected encoding for specific files. You can demonstrate this by logging in as another user (I logged in as root for this experiment) and opening the same file -- gedit will read it using the default system locale, not the custom locale that you saved it in under the other login.

153

answered Nov 05 '22 21:11

sig_seg_v

Related questions
                            
                                Oprofile vs perf [closed]
                            
                                Sending keystroke to a process
                            
                                creating multiple copies of a file in bash with a script
                            
                                Execute the content of binary from a pipe
                            
                                Shared Volume in Docker through Vagrant
                            
                                How do I build a C# file through Mono on Linux command line?
                            
                                How does linux kill D status process during reboot?
                            
                                Trap all accesses to an address range (Linux)
                            
                                grep -f file to print in order as a file
                            
                                How does Docker share resources
                            
                                argv: Sanitizing wildcards
                            
                                Detect when reader closes named pipe (FIFO)
                            
                                Writing to Embedded Controller registers in Ubuntu 14.04
                            
                                MariaDB password reset not working
                            
                                Make vim follow symlinks when opening files from command line
                            
                                Compiled Python binary reports wrong version
                            
                                What's the state of developing iOS apps in Linux? [closed]
                            
                                Little performance increasing when using multiple threads
                            
                                An implicit try {} catch around main
                            
                                undefined reference to `icu_56::UnicodeString::UnicodeString(signed char, unsigned short const*, int)'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Where is the character encoding of a text file stored in Linux?

Tags:

linux

encoding

unicode

utf-8

matteo

People also ask

1 Answers

sig_seg_v

Recent Activity

Donate For Us