I have some text files that're encoded by different character encodings, such as ascii
, utf-8
, big5
, gb2312
.
Now I want to know their accurate character encodings to view them with an text editor, otherwise, they will present garbled characters.
I searched online and found file
command could display the character encoding of a file, like:
$ file -bi *
text/plain; charset=iso-8859-1
text/plain; charset=us-ascii
text/plain; charset=iso-8859-1
text/plain; charset=utf-8
Unfortunately, files encoded with big5
and gb2312
both present charset=iso-8859-1
, so I still couldn't make a distinction.
Is there a better way to check character encoding of a text file?
To verify if a file passes an encoding such as ascii, iso-8859-1, utf-8 or whatever then a good solution is to use the 'iconv' command.
Open up your file using regular old vanilla Notepad that comes with Windows. It will show you the encoding of the file when you click "Save As...". Whatever the default-selected encoding is, that is what your current encoding is for the file.
There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16 ); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for ...
One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).
To some extent, @ewcz's advice works.
$ uchardet *
big5.txt: BIG5
conf: ASCII
gb2312-windows.txt: GB18030
gb.txt: GB18030
test.java: UTF-8
And
enca -L chinese *
big5.txt: Traditional Chinese Industrial Standard; Big5
conf: 7bit ASCII characters
gb2312-windows.txt: Simplified Chinese National Standard; GB2312
CRLF line terminators
gb.txt: Simplified Chinese National Standard; GB2312
test.java: Universal transformation format 8 bits; UTF-8
You can use a command line tool like detect-file-encoding-and-language:
$ npm install -g detect-file-encoding-and-language
Then you can detect the encoding like so:
$ dfeal "/home/user name/Documents/subtitle file.srt"
# Possible result: { language: french, encoding: CP1252, confidence: { language: 0.99, encoding: 1 } }
Make sure you have Node.js and NPM installed! If you don't have it installed already:
$ sudo apt install nodejs npm
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With