How can I check the file encoding in a shell script? I need to know if a file is encoded in utf-8 or iso-8859-1. Thanks

I'd just use <pre class="prettyprint"><code>file -bi myfile.txt </code></pre> to determine the character encoding of a particular file. A solution with an external dependency but I suspect <code>file</code> is very common nowadays among all semi-modern distro's. EDIT: As a response to Laurence Gonsalves' comment: <code>b</code> is the option to be 'brief' (not include the filename) and <code>i</code> is the shorthand equivalent of <code>--mime</code> so the most portable way (including Mac OSX) then probably is: <pre class="prettyprint"><code>file --mime myfile.txt </code></pre>

you can use the file command <code>file --mime myfile.text</code>

encoding of file shell script

4 Answers

I'd just use

file -bi myfile.txt

to determine the character encoding of a particular file.

A solution with an external dependency but I suspect file is very common nowadays among all semi-modern distro's.

EDIT:

As a response to Laurence Gonsalves' comment: b is the option to be 'brief' (not include the filename) and i is the shorthand equivalent of --mime so the most portable way (including Mac OSX) then probably is:

file --mime myfile.txt

133

answered Oct 06 '22 07:10

ChristopheD

There's no way to be 100% certain (unless you're dealing with a file format that internally states its encoding).

Most tools that attempt to make this distinction will try and decode the file as utf-8 (as that's the more strict encoding), and if that fails, then fall back to iso-8859-1. You can do this with iconv "by hand", or you can use file:

$ file utf8.txt
utf8.txt: UTF-8 Unicode text
$ file latin1.txt
latin1.txt: ISO-8859 text

Note that ASCII files are both UTF-8 and ISO-8859-1 compatible.

$ file ascii.txt
ascii.txt: ASCII text

Finally: there's no real way to distinguish between ISO-8859-1 and ISO-8859-2, for example, unless you're going to assume it's natural language and use statistical methods. This is probably why file says "ISO-8859".

answered Oct 06 '22 06:10

Laurence Gonsalves

you can use the file command file --mime myfile.text

answered Oct 06 '22 07:10

jochil

File command is not 100% certain. Simple test:

#!/bin/bash

echo "a" > /tmp/foo

for i in {1..1000000}
do
  echo "asdas" >> /tmp/foo
done

echo "üöäÄÜÖß " >> /tmp/foo

file -b --mime-encoding /tmp/foo

this outputs:

us-ascii

Ascii does not know german umlauts.

File is a bunch of bytes (sequence of bytes). Without trusting meta data (BOM only recomended for utf-16 and utf-32, MIME, header of data) you can't really detect encoding. Sequence of bytes can be interpreted as utf-8 or ISO-8859-1/2 or anything you want. Well it depends for certain sequence if iso-8850-1/utf-8 map exist. What you want is to encode the whole file content to desired character encoding. If it fails the desired encoding does not have map for this sequence of bytes.

In shell maybe use python, perl or like Laurence Gonsalves says iconv. For text files I use in python this:

f = codecs.open(path, encoding='utf-8', errors='strict')


def valid_string(str):
  try:
    str.decode('utf-8')
    return True
  except UnicodeDecodeError:
    return False

How do you that a file is a text file. You don't. You encode line by line with desired character encoding. Ok, you can add a little trust and check if BOM exists (file is utf encoded).

answered Oct 06 '22 07:10

broadband

Related questions
                            
                                Find count of files matching a pattern in a directory in linux
                            
                                Fish Shell: How can I customize the colors for the autocomplete feature?
                            
                                How to use sed to replace a config file's variable?
                            
                                Is it possible to pipe the results of FIND to a COPY command CP?
                            
                                Can I get the system home directory in CMake on Linux?
                            
                                detecting keyboard, mouse activity in linux
                            
                                Extracting data from HTML table
                            
                                sed join lines together
                            
                                How does nested if/then/elseif work in bash? [closed]
                            
                                user time larger than real time
                            
                                How to set Environment Variables on EC2 instance via User Data
                            
                                How do YOU manage Perl modules when using a package manager?
                            
                                Best way to get machine id on Linux?
                            
                                How to get PID from forked child process in shell script
                            
                                Can I use ECHO to execute commands?
                            
                                Exclude an alias from virtualhost proxypass
                            
                                How to confirm RedHat Enterprise Linux version? [closed]
                            
                                Process name from its pid in linux
                            
                                How to prevent grep from printing a trailing newline?
                            
                                Overhead of pthread mutexes?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

encoding of file shell script

Tags:

linux

bash

shell

encoding

rizidoro

People also ask

4 Answers

ChristopheD

Laurence Gonsalves

jochil

broadband

Recent Activity

Donate For Us