Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

encoding of file shell script

How can I check the file encoding in a shell script? I need to know if a file is encoded in utf-8 or iso-8859-1.

Thanks

like image 483
rizidoro Avatar asked Nov 13 '09 17:11

rizidoro


People also ask

How do I check the encoding of a file in Unix?

To verify if a file passes an encoding such as ascii, iso-8859-1, utf-8 or whatever then a good solution is to use the 'iconv' command.

How can I tell the encoding of a file?

In Visual Studio, you can select "File > Advanced Save Options..." The "Encoding:" combo box will tell you specifically which encoding is currently being used for the file.

How do I know if my file is UTF 16 or UTF-8?

There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16 ); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for ...


4 Answers

I'd just use

file -bi myfile.txt

to determine the character encoding of a particular file.

A solution with an external dependency but I suspect file is very common nowadays among all semi-modern distro's.

EDIT:

As a response to Laurence Gonsalves' comment: b is the option to be 'brief' (not include the filename) and i is the shorthand equivalent of --mime so the most portable way (including Mac OSX) then probably is:

file --mime myfile.txt 
like image 133
ChristopheD Avatar answered Oct 06 '22 07:10

ChristopheD


There's no way to be 100% certain (unless you're dealing with a file format that internally states its encoding).

Most tools that attempt to make this distinction will try and decode the file as utf-8 (as that's the more strict encoding), and if that fails, then fall back to iso-8859-1. You can do this with iconv "by hand", or you can use file:

$ file utf8.txt
utf8.txt: UTF-8 Unicode text
$ file latin1.txt
latin1.txt: ISO-8859 text

Note that ASCII files are both UTF-8 and ISO-8859-1 compatible.

$ file ascii.txt
ascii.txt: ASCII text

Finally: there's no real way to distinguish between ISO-8859-1 and ISO-8859-2, for example, unless you're going to assume it's natural language and use statistical methods. This is probably why file says "ISO-8859".

like image 42
Laurence Gonsalves Avatar answered Oct 06 '22 06:10

Laurence Gonsalves


you can use the file command file --mime myfile.text

like image 25
jochil Avatar answered Oct 06 '22 07:10

jochil


File command is not 100% certain. Simple test:

#!/bin/bash

echo "a" > /tmp/foo

for i in {1..1000000}
do
  echo "asdas" >> /tmp/foo
done

echo "üöäÄÜÖß " >> /tmp/foo

file -b --mime-encoding /tmp/foo

this outputs:

us-ascii

Ascii does not know german umlauts.

File is a bunch of bytes (sequence of bytes). Without trusting meta data (BOM only recomended for utf-16 and utf-32, MIME, header of data) you can't really detect encoding. Sequence of bytes can be interpreted as utf-8 or ISO-8859-1/2 or anything you want. Well it depends for certain sequence if iso-8850-1/utf-8 map exist. What you want is to encode the whole file content to desired character encoding. If it fails the desired encoding does not have map for this sequence of bytes.

In shell maybe use python, perl or like Laurence Gonsalves says iconv. For text files I use in python this:

f = codecs.open(path, encoding='utf-8', errors='strict')


def valid_string(str):
  try:
    str.decode('utf-8')
    return True
  except UnicodeDecodeError:
    return False

How do you that a file is a text file. You don't. You encode line by line with desired character encoding. Ok, you can add a little trust and check if BOM exists (file is utf encoded).

like image 28
broadband Avatar answered Oct 06 '22 07:10

broadband