How can I check the file encoding in a shell script? I need to know if a file is encoded in utf-8 or iso-8859-1.
Thanks
To verify if a file passes an encoding such as ascii, iso-8859-1, utf-8 or whatever then a good solution is to use the 'iconv' command.
In Visual Studio, you can select "File > Advanced Save Options..." The "Encoding:" combo box will tell you specifically which encoding is currently being used for the file.
There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16 ); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for ...
I'd just use
file -bi myfile.txt
to determine the character encoding of a particular file.
A solution with an external dependency but I suspect file
is very common nowadays among all semi-modern distro's.
EDIT:
As a response to Laurence Gonsalves' comment: b
is the option to be 'brief' (not include the filename) and i
is the shorthand equivalent of --mime
so the most portable way (including Mac OSX) then probably is:
file --mime myfile.txt
There's no way to be 100% certain (unless you're dealing with a file format that internally states its encoding).
Most tools that attempt to make this distinction will try and decode the file as utf-8 (as that's the more strict encoding), and if that fails, then fall back to iso-8859-1. You can do this with iconv
"by hand", or you can use file
:
$ file utf8.txt
utf8.txt: UTF-8 Unicode text
$ file latin1.txt
latin1.txt: ISO-8859 text
Note that ASCII files are both UTF-8 and ISO-8859-1 compatible.
$ file ascii.txt
ascii.txt: ASCII text
Finally: there's no real way to distinguish between ISO-8859-1 and ISO-8859-2, for example, unless you're going to assume it's natural language and use statistical methods. This is probably why file says "ISO-8859".
you can use the file command
file --mime myfile.text
File command is not 100% certain. Simple test:
#!/bin/bash
echo "a" > /tmp/foo
for i in {1..1000000}
do
echo "asdas" >> /tmp/foo
done
echo "üöäÄÜÖß " >> /tmp/foo
file -b --mime-encoding /tmp/foo
this outputs:
us-ascii
Ascii does not know german umlauts.
File is a bunch of bytes (sequence of bytes). Without trusting meta data (BOM only recomended for utf-16 and utf-32, MIME, header of data) you can't really detect encoding. Sequence of bytes can be interpreted as utf-8 or ISO-8859-1/2 or anything you want. Well it depends for certain sequence if iso-8850-1/utf-8 map exist. What you want is to encode the whole file content to desired character encoding. If it fails the desired encoding does not have map for this sequence of bytes.
In shell maybe use python, perl or like Laurence Gonsalves says iconv. For text files I use in python this:
f = codecs.open(path, encoding='utf-8', errors='strict')
def valid_string(str):
try:
str.decode('utf-8')
return True
except UnicodeDecodeError:
return False
How do you that a file is a text file. You don't. You encode line by line with desired character encoding. Ok, you can add a little trust and check if BOM exists (file is utf encoded).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With