I'm trying to transcode a bunch of files from US-ASCII to UTF-8.
For that, I'm using iconv:
iconv -f US-ASCII -t UTF-8 file.php > file-utf8.php
My original files are US-ASCII encoded, which makes the conversion not happen. Apparently it occurs because ASCII is a subset of UTF-8...
iconv US ASCII to UTF-8 or ISO-8859-15
And quoting:
There's no need for the textfile to appear otherwise until non-ASCII characters are introduced
True. If I introduce a non-ASCII character in the file and save it, let's say with Eclipse, the file encoding (charset) is switched to UTF-8.
In my case, I'd like to force iconv to transcode the files to UTF-8 anyway. Whether there is non-ASCII characters in it or not.
Note: The reason is my PHP code (non-ASCII files...) is dealing with some non-ASCII string, which causes the strings not to be well interpreted (french):
Il était une fois... l'homme série animée mythique d'Albert
Barillé (Procidis), 1ère
...
US ASCII
-- is -- a subset of UTF-8
(see Ned's answer below)UTF-8
US-ASCII is a 7-bit code, and it's a true subset of UTF-8. In other words, every ASCII file is by definition also an UTF-8 file. The file command classifies it as 'ASCII' because there are no 8-bit characters in it, and it's totally right in doing that.
UTF-8 is a Unicode character encoding method. This means that UTF-8 takes the code point for a given Unicode character and translates it into a string of binary. It also does the reverse, reading in binary digits and converting them back to characters.
iconv command is used to convert some text in one encoding into another encoding. If no input file is provided then it reads from standard input. Similarly, if no output file is given then it writes to standard output. If no from-encoding or to-encoding is provided then it uses current local's character encoding.
ASCII is essentially just UTF-8, or we can say that ASCII is a subset of Unicode.
I'm trying to transcode a bunch a files from ASCII to UTF-8. Still that file didn't convert to UTF-8. It is a .dat file. ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes.
ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them. Force encode from US-ASCII to UTF-8 (iconv)
Below is an example of ASCII encoding. In Linux, the iconv command line tool is used to convert text from one form of encoding to another. You can check the encoding of a file using the file command, by using the -i or --mime flag which enables printing of mime type string as in the examples below:
you can use hexdump to look at bytes of non-7-bit-ASCII text and compare against code tables for common encodings (ISO 8859-*, UTF-8) to decide for yourself what the encoding is. iconv will use whatever input/output encoding you specify regardless of what the contents of the file are.
ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them, so there's no need to do anything.
It looks like your problem is that the files are not actually ASCII. You need to determine what encoding they are using, and transcode them properly.
file
only guesses at the file encoding and may be wrong (especially in cases where special characters only appear late in large files).hexdump
to look at bytes of non-7-bit-ASCII text and compare against code tables for common encodings (ISO 8859-*, UTF-8) to decide for yourself what the encoding is.iconv
will use whatever input/output encoding you specify regardless of what the contents of the file are. If you specify the wrong input encoding, the output will be garbled.iconv
, file
may not report any change due to the limited way in which file
attempts to guess at the encoding. For a specific example, see my long answer.I ran into this today and came across your question. Perhaps I can add a little more information to help other people who run into this issue.
First, the term ASCII is overloaded, and that leads to confusion.
7-bit ASCII only includes 128 characters (00-7F or 0-127 in decimal). 7-bit ASCII is also sometimes referred to as US-ASCII.
ASCII
UTF-8 encoding uses the same encoding as 7-bit ASCII for its first 128 characters. So a text file that only contains characters from that range of the first 128 characters will be identical at a byte level whether encoded with UTF-8 or 7-bit ASCII.
Codepage layout
The term extended ASCII (or high ASCII) refers to eight-bit or larger character encodings that include the standard seven-bit ASCII characters, plus additional characters.
Extended ASCII
ISO 8859-1 (aka "ISO Latin 1") is a specific 8-bit ASCII extension standard that covers most characters for Western Europe. There are other ISO standards for Eastern European languages and Cyrillic languages. ISO 8859-1 includes encoding for characters like Ö, é, ñ and ß for German and Spanish (UTF-8 supports these characters too, but the underlying encoding is different).
"Extension" means that ISO 8859-1 includes the 7-bit ASCII standard and adds characters to it by using the 8th bit. So for the first 128 characters, ISO 8859-1 is equivalent at a byte level to both ASCII and UTF-8 encoded files. However, when you start dealing with characters beyond the first 128, you are no longer UTF-8 equivalent at the byte level, and you must do a conversion if you want your "extended ASCII" encoded file to be UTF-8 encoded.
ISO 8859 and proprietary adaptations
file
One lesson I learned today is that we can't trust file
to always give correct interpretation of a file's character encoding.
file (command)
The command tells only what the file looks like, not what it is (in the case where file looks at the content). It is easy to fool the program by putting a magic number into a file the content of which does not match it. Thus the command is not usable as a security tool other than in specific situations.
file
looks for magic numbers in the file that hint at the type, but these can be wrong, no guarantee of correctness. file
also tries to guess the character encoding by looking at the bytes in the file. Basically file
has a series of tests that helps it guess at the file type and encoding.
My file is a large CSV file. file
reports this file as US ASCII encoded, which is WRONG.
$ ls -lh total 850832 -rw-r--r-- 1 mattp staff 415M Mar 14 16:38 source-file $ file -b --mime-type source-file text/plain $ file -b --mime-encoding source-file us-ascii
My file has umlauts in it (ie Ö). The first non-7-bit-ascii doesn't show up until over 100k lines into the file. I suspect this is why file
doesn't realize the file encoding isn't US-ASCII.
$ pcregrep -no '[^\x00-\x7F]' source-file | head -n1 102321:�
I'm on a Mac, so using PCRE's grep
. With GNU grep you could use the -P
option. Alternatively on a Mac, one could install coreutils (via Homebrew or other) in order to get GNU grep.
I haven't dug into the source-code of file
, and the man page doesn't discuss the text encoding detection in detail, but I am guessing file
doesn't look at the whole file before guessing encoding.
Whatever my file's encoding is, these non-7-bit-ASCII characters break stuff. My German CSV file is ;
-separated and extracting a single column doesn't work.
$ cut -d";" -f1 source-file > tmp cut: stdin: Illegal byte sequence $ wc -l * 3081673 source-file 102320 tmp 3183993 total
Note the cut
error and that my "tmp" file has only 102320 lines with the first special character on line 102321.
Let's take a look at how these non-ASCII characters are encoded. I dump the first non-7-bit-ascii into hexdump
, do a little formatting, remove the newlines (0a
) and take just the first few.
$ pcregrep -o '[^\x00-\x7F]' source-file | head -n1 | hexdump -v -e '1/1 "%02x\n"' d6 0a
Another way. I know the first non-7-bit-ASCII char is at position 85 on line 102321. I grab that line and tell hexdump
to take the two bytes starting at position 85. You can see the special (non-7-bit-ASCII) character represented by a ".", and the next byte is "M"... so this is a single-byte character encoding.
$ tail -n +102321 source-file | head -n1 | hexdump -C -s85 -n2 00000055 d6 4d |.M| 00000057
In both cases, we see the special character is represented by d6
. Since this character is an Ö which is a German letter, I am guessing that ISO 8859-1 should include this. Sure enough, you can see "d6" is a match (ISO/IEC 8859-1).
Important question... how do I know this character is an Ö without being sure of the file encoding? The answer is context. I opened the file, read the text and then determined what character it is supposed to be. If I open it in Vim it displays as an Ö because Vim does a better job of guessing the character encoding (in this case) than file
does.
So, my file seems to be ISO 8859-1. In theory I should check the rest of the non-7-bit-ASCII characters to make sure ISO 8859-1 is a good fit... There is nothing that forces a program to only use a single encoding when writing a file to disk (other than good manners).
I'll skip the check and move on to conversion step.
$ iconv -f iso-8859-1 -t utf8 source-file > output-file $ file -b --mime-encoding output-file us-ascii
Hmm. file
still tells me this file is US ASCII even after conversion. Let's check with hexdump
again.
$ tail -n +102321 output-file | head -n1 | hexdump -C -s85 -n2 00000055 c3 96 |..| 00000057
Definitely a change. Note that we have two bytes of non-7-bit-ASCII (represented by the "." on the right) and the hex code for the two bytes is now c3 96
. If we take a look, seems we have UTF-8 now (c3 96
is the encoding of Ö
in UTF-8) UTF-8 encoding table and Unicode characters
But file
still reports our file as us-ascii
? Well, I think this goes back to the point about file
not looking at the whole file and the fact that the first non-7-bit-ASCII characters don't occur until late in the file.
I'll use sed
to stick a Ö at the beginning of the file and see what happens.
$ sed '1s/^/Ö\'$'\n/' source-file > test-file $ head -n1 test-file Ö $ head -n1 test-file | hexdump -C 00000000 c3 96 0a |...| 00000003
Cool, we have an umlaut. Note the encoding though is c3 96
(UTF-8). Hmm.
Checking our other umlauts in the same file again:
$ tail -n +102322 test-file | head -n1 | hexdump -C -s85 -n2 00000055 d6 4d |.M| 00000057
ISO 8859-1. Oops! It just goes to show how easy it is to get the encodings screwed up. To be clear, I've managed to create a mix of UTF-8 and ISO 8859-1 encodings in the same file.
Let's try converting our mangled (mixed encoding) test file with the umlaut (Ö) at the front and see what happens.
$ iconv -f iso-8859-1 -t utf8 test-file > test-file-converted $ head -n1 test-file-converted | hexdump -C 00000000 c3 83 c2 96 0a |.....| 00000005 $ tail -n +102322 test-file-converted | head -n1 | hexdump -C -s85 -n2 00000055 c3 96 |..| 00000057
The first umlaut that was UTF-8 was interpreted as ISO 8859-1 since that is what we told iconv
...not what we want, but that is what we told iconf to do. The second umlaut is correctly converted from d6
(ISO 8859-1) to c3 96
(UTF-8).
I'll try again, but this time I will use Vim to do the Ö insertion instead of sed
. Vim seemed to detect the encoding better before (as "latin1" aka ISO 8859-1) so perhaps it will insert the new Ö with a consistent encoding.
$ vim source-file $ head -n1 test-file-2 � $ head -n1 test-file-2 | hexdump -C 00000000 d6 0d 0a |...| 00000003 $ tail -n +102322 test-file-2 | head -n1 | hexdump -C -s85 -n2 00000055 d6 4d |.M| 00000057
Indeed vim used the correct/consistent ISO encoding when inserting the character at the beginning of the file.
Now the test: Does file do a better job of recognizing the encoding with special characters at the beginning of the file?
$ file -b --mime-encoding test-file-2 iso-8859-1 $ iconv -f iso-8859-1 -t utf8 test-file-2 > test-file-2-converted $ file -b --mime-encoding test-file-2-converted utf-8
Yes it does! Moral of the story. Don't trust file
to always guess your encoding right. It is easy to mix encodings within the same file. When in doubt, look at the hex.
A hack that would address this specific limitation of file
when dealing with large files would be to shorten the file to make sure that special (non-ascii) characters appear early in the file so file
is more likely to find them.
$ first_special=$(pcregrep -o1 -n '()[^\x00-\x7F]' source-file | head -n1 | cut -d":" -f1) $ tail -n +$first_special source-file > /tmp/source-file-shorter $ file -b --mime-encoding /tmp/source-file-shorter iso-8859-1
You could then use (presumably correct) detected encoding to feed as input to iconv
to ensure you are converting correctly.
Christos Zoulas updated file
to make the amount of bytes looked at configurable. One day turn-around on the feature request, awesome!
http://bugs.gw.com/view.php?id=533 Allow altering how many bytes to read from analyzed files from the command line
The feature was released in file
version 5.26.
Looking at more of a large file before making a guess about encoding takes time. However, it is nice to have the option for specific use-cases where a better guess may outweigh additional time and I/O.
Use the following option:
−P, −−parameter name=value Set various parameter limits. Name Default Explanation bytes 1048576 max number of bytes to read from file
Something like...
file_to_check="myfile" bytes_to_scan=$(wc -c < $file_to_check) file -b --mime-encoding -P bytes=$bytes_to_scan $file_to_check
... it should do the trick if you want to force file
to look at the whole file before making a guess. Of course, this only works if you have file
5.26 or newer.
file
to display UTF-8 instead of US-ASCIISome of the other answers seem to focus on trying to make file
display UTF-8 even if the file only contains plain 7-bit ascii. If you think this through you should probably never want to do this.
file
command is saying the file is UTF-8, that implies that the file contains some characters with UTF-8 specific encoding. If that isn't really true, it could cause confusion or problems down the line. If file
displayed UTF-8 when the file only contained 7-bit ascii characters, this would be a bug in the file
program.file
command output before accepting a file as input and it won't process the file unless it "sees" UTF-8...well that is pretty bad design. I would argue this is a bug in that program.If you absolutely must take a plain 7-bit ascii file and convert it to UTF-8, simply insert a single non-7-bit-ascii character into the file with UTF-8 encoding for that character and you are done. But I can't imagine a use-case where you would need to do this. The easiest UTF-8 character to use for this is the Byte Order Mark (BOM) which is a special non-printing character that hints that the file is non-ascii. This is probably the best choice because it should not visually impact the file contents as it will generally be ignored.
Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII.
This is key:
or the file contains only ASCII
So some tools on windows have trouble reading UTF-8 files unless the BOM character is present. However this does not affect plain 7-bit ascii only files. I.e. this is not a reason for forcing plain 7-bit ascii files to be UTF-8 by adding a BOM character.
Here is more discussion about potential pitfalls of using the BOM when not needed (it IS needed for actual UTF-8 files that are consumed by some Microsoft apps). https://stackoverflow.com/a/13398447/3616686
Nevertheless if you still want to do it, I would be interested in hearing your use case. Here is how. In UTF-8 the BOM is represented by hex sequence 0xEF,0xBB,0xBF
and so we can easily add this character to the front of our plain 7-bit ascii file. By adding a non-7-bit ascii character to the file, the file is no longer only 7-bit ascii. Note that we have not modified or converted the original 7-bit-ascii content at all. We have added a single non-7-bit-ascii character to the beginning of the file and so the file is no longer entirely composed of 7-bit-ascii characters.
$ printf '\xEF\xBB\xBF' > bom.txt # put a UTF-8 BOM char in new file $ file bom.txt bom.txt: UTF-8 Unicode text, with no line terminators $ file plain-ascii.txt # our pure 7-bit ascii file plain-ascii.txt: ASCII text $ cat bom.txt plain-ascii.txt > plain-ascii-with-utf8-bom.txt # put them together into one new file with the BOM first $ file plain-ascii-with-utf8-bom.txt plain-ascii-with-utf8-bom.txt: UTF-8 Unicode (with BOM) text
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With