I have a question that may be quite naive, but I feel the need to ask, because I don't really know what is going on. I'm on Ubuntu.
Suppose I do
echo "t" > test.txt
if I then
file test.txt
I get test.txt:ASCII text
If I then do
echo "å" > test.txt
Then I get
test.txt: UTF-8 Unicode text
How does that happen? How does file "know" the encoding, or, alternatively, how does it guess it?
Thanks.
There are certain byte sequences that suggest that UTF-8 encoding may be in use (see Wikipedia). If file finds one or more of those and doesn't find anything that can't occur in UTF-8, it's a fair guess that the file is encoded in UTF-8. But again, just a guess. For the basic ASCII character set (normal characters like 't'), the binary representation is the same in most common encodings (including UTF-8), so if a file contains only basic ASCII characters, file has no way to tell which of the many ASCII-compatible encodings was intended. It just goes with ASCII by default.
The other thing to take note of is that your shell is set to use UTF-8, which is why the file gets written in UTF-8 in the first place. Conceivably, you could set the shell to use another encoding like UTF-16, and then the command
echo "å" > test.txt
would write a file using UTF-16.
From the file manpage:
If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified as ''text'' because they will be mostly readable on nearly any terminal; UTF-16 and EBCDIC are only ''character data'' because, while they contain text, it is text that will require translation before it can be read. In addition, file will attempt to determine other characteristics of text-type files. If the lines of a file are terminated by CR, CRLF, or NEL, instead of the Unix-standard LF, this will be reported. Files that contain embedded escape sequences or overstriking will also be identified.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With