I have some text files that're encoded by different character encodings, such as <code>ascii</code>, <code>utf-8</code>, <code>big5</code>, <code>gb2312</code>. Now I want to know their accurate character encodings to view them with an text editor, otherwise, they will present garbled characters. I searched online and found <code>file</code> command could display the character encoding of a file, like: <pre class="prettyprint"><code>$ file -bi * text/plain; charset=iso-8859-1 text/plain; charset=us-ascii text/plain; charset=iso-8859-1 text/plain; charset=utf-8 </code></pre> Unfortunately, files encoded with <code>big5</code> and <code>gb2312</code> both present <code>charset=iso-8859-1</code>, so I still couldn't make a distinction. Is there a better way to check character encoding of a text file?

You can use a command line tool like detect-file-encoding-and-language: <pre class="prettyprint lang-sh prettyprint-override"><code>$ npm install -g detect-file-encoding-and-language </code></pre> Then you can detect the encoding like so: <pre class="prettyprint lang-sh prettyprint-override"><code>$ dfeal "/home/user name/Documents/subtitle file.srt" # Possible result: { language: french, encoding: CP1252, confidence: { language: 0.99, encoding: 1 } } </code></pre> Make sure you have Node.js and NPM installed! If you don't have it installed already: <pre class="prettyprint"><code>$ sudo apt install nodejs npm </code></pre>

How to check character encoding of a file in Linux

Tags:

linux

character-encoding

encoding

utf-8

I have some text files that're encoded by different character encodings, such as ascii, utf-8, big5, gb2312.

Now I want to know their accurate character encodings to view them with an text editor, otherwise, they will present garbled characters.

I searched online and found file command could display the character encoding of a file, like:

$ file -bi *
text/plain; charset=iso-8859-1
text/plain; charset=us-ascii
text/plain; charset=iso-8859-1
text/plain; charset=utf-8

Unfortunately, files encoded with big5 and gb2312 both present charset=iso-8859-1, so I still couldn't make a distinction. Is there a better way to check character encoding of a text file?

934

asked Feb 11 '18 07:02

Young

2 Answers

To some extent, @ewcz's advice works.

$ uchardet *
big5.txt: BIG5
conf: ASCII
gb2312-windows.txt: GB18030
gb.txt: GB18030
test.java: UTF-8

And

enca -L chinese *
big5.txt: Traditional Chinese Industrial Standard; Big5
conf: 7bit ASCII characters
gb2312-windows.txt: Simplified Chinese National Standard; GB2312
  CRLF line terminators
gb.txt: Simplified Chinese National Standard; GB2312
test.java: Universal transformation format 8 bits; UTF-8

168

answered Nov 07 '22 15:11

Young

You can use a command line tool like detect-file-encoding-and-language:

$ npm install -g detect-file-encoding-and-language

Then you can detect the encoding like so:

$ dfeal "/home/user name/Documents/subtitle file.srt"
# Possible result: { language: french, encoding: CP1252, confidence: { language: 0.99, encoding: 1 } }

Make sure you have Node.js and NPM installed! If you don't have it installed already:

$ sudo apt install nodejs npm

answered Nov 07 '22 17:11

Falaen

Related questions
                            
                                Difference in C++11 async behaviour on Mac and Linux
                            
                                Ping Command inside docker container doesnt work
                            
                                'gtk/gtk.h' file not found Even with pkg-config
                            
                                linux xammp Visual Studio Code configuration
                            
                                Bash: Killing all processes in subprocess
                            
                                curl: per-file additional headers from CLI (for multipart POST, SOAP, etc.)
                            
                                Git commit message on linux terminal
                            
                                docker for windows equivalent for "-v /var/run/docker.sock:/var/run/docker.sock"
                            
                                md5deep ubuntu install - command not found
                            
                                Moving to different Linux build system, getting error: undefined symbol: stat
                            
                                Cannot connect to the Docker daemon (port 2375)
                            
                                What does grep -Po '...\K...' do? How else can that effect be achieved?
                            
                                Why do I get /etc/cups conflicts between attempted installs in Yocto?
                            
                                Different gcc output for __builtin_clzll on different optimisation levels and wrapped in a function
                            
                                apt-get: How to bypass pressing ENTER
                            
                                How to protect data protection key files with a certificate on Asp.Net Core 2 on debian/linux
                            
                                leiningen cannot run because of java class not found exception
                            
                                Access Tensorflow from Tomcat on CentOS Linux
                            
                                How can I execute parallel "for" loops in Bash?
                            
                                Deploy a C# Stateful Service Fabric application from Visual Studio to Linux

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to check character encoding of a file in Linux

Tags:

linux

character-encoding

encoding

utf-8

Young

People also ask

2 Answers

Young

Falaen

Recent Activity

Donate For Us