Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check character encoding of a file in Linux

I have some text files that're encoded by different character encodings, such as ascii, utf-8, big5, gb2312.

Now I want to know their accurate character encodings to view them with an text editor, otherwise, they will present garbled characters.

I searched online and found file command could display the character encoding of a file, like:

$ file -bi *
text/plain; charset=iso-8859-1
text/plain; charset=us-ascii
text/plain; charset=iso-8859-1
text/plain; charset=utf-8

Unfortunately, files encoded with big5 and gb2312 both present charset=iso-8859-1, so I still couldn't make a distinction. Is there a better way to check character encoding of a text file?

like image 934
Young Avatar asked Feb 11 '18 07:02

Young


People also ask

How do I check if a file is UTF-8 encoded Linux?

To verify if a file passes an encoding such as ascii, iso-8859-1, utf-8 or whatever then a good solution is to use the 'iconv' command.

How do I know the encoding of a file?

Open up your file using regular old vanilla Notepad that comes with Windows. It will show you the encoding of the file when you click "Save As...". Whatever the default-selected encoding is, that is what your current encoding is for the file.

How do I know if my file is UTF 16 or UTF-8?

There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16 ); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for ...

How do you determine character encoding?

One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).


2 Answers

To some extent, @ewcz's advice works.

$ uchardet *
big5.txt: BIG5
conf: ASCII
gb2312-windows.txt: GB18030
gb.txt: GB18030
test.java: UTF-8

And

enca -L chinese *
big5.txt: Traditional Chinese Industrial Standard; Big5
conf: 7bit ASCII characters
gb2312-windows.txt: Simplified Chinese National Standard; GB2312
  CRLF line terminators
gb.txt: Simplified Chinese National Standard; GB2312
test.java: Universal transformation format 8 bits; UTF-8
like image 168
Young Avatar answered Nov 07 '22 15:11

Young


You can use a command line tool like detect-file-encoding-and-language:

$ npm install -g detect-file-encoding-and-language

Then you can detect the encoding like so:

$ dfeal "/home/user name/Documents/subtitle file.srt"
# Possible result: { language: french, encoding: CP1252, confidence: { language: 0.99, encoding: 1 } }

Make sure you have Node.js and NPM installed! If you don't have it installed already:

$ sudo apt install nodejs npm
like image 23
Falaen Avatar answered Nov 07 '22 17:11

Falaen