I have a PHP file that I created with VIM, but I'm not sure which is its encoding. When I use the terminal and check the encoding with the command <code>file -bi foo</code> (My operating system is Ubuntu 11.04) it gives me the next result: <code>text/html; charset=us-ascii</code> But, when I open the file with gedit it says its encoding is UTF-8. Which one is correct? I want the file to be encoded in UTF-8. My guess is that there's no BOM in the file and that the command <code>file -bi</code> reads the file and doesn't find any UTF-8 characters, so it assumes that it's ascii, but in reality it's encoded in UTF-8.

<pre class="prettyprint"><code>$ file --mime my.txt my.txt: text/plain; charset=iso-8859-1 </code></pre>

(on Linux) <pre class="prettyprint"><code>$ chardet <filename> </code></pre> it also delivers the confidence level [0-1] of the output.

How can I be sure of the file encoding?

3 Answers

$ file --mime my.txt 
my.txt: text/plain; charset=iso-8859-1

answered Oct 20 '22 14:10

Green Lei

Well, first of all, note that ASCII is a subset of UTF-8, so if your file contains only ASCII characters, it's correct to say that it's encoded in ASCII and it's correct to say that it's encoded in UTF-8.

That being said, file typically only examines a short segment at the beginning of the file to determine its type, so it might be declaring it us-ascii if there are non-ASCII characters but they are beyond the initial segment of the file. On the other hand, gedit might say that the file is UTF-8 even if it's ASCII because UTF-8 is gedit's preferred character encoding and it intends to save the file with UTF-8 if you were to add any non-ASCII characters during your edit session. Again, if that's what gedit is saying, it wouldn't be wrong.

Now to your question:

Run this command:
```
tr -d \\000-\\177 < your-file | wc -c
```
If the output says "0", then the file contains only ASCII characters. It's in ASCII (and it's also valid UTF-8) End of story.
Run this command
```
iconv -f utf-8 -t ucs-4 < your-file >/dev/null
```
If you get an error, the file does not contain valid UTF-8 (or at least, some part of it is corrupted).

If you get no error, the file is extremely likely to be UTF-8. That's because UTF-8 has properties that make it very hard to mistake typical text in any other commonly used character encoding for valid UTF-8.

answered Oct 20 '22 15:10

Celada

(on Linux)

$ chardet <filename>

it also delivers the confidence level [0-1] of the output.

answered Oct 20 '22 15:10

Arthur Zennig

Related questions
                            
                                Java BufferedWriter object with utf-8
                            
                                Character with encoding UTF8 has no equivalent in WIN1252
                            
                                Write ObjectNode to JSON String with UTF-8 Characters to Escaped ASCII
                            
                                How to replace/remove 4(+)-byte characters from a UTF-8 string in PHP?
                            
                                Detect charset and convert to utf-8 in Python? [duplicate]
                            
                                Java + Mysql UTF8 Problem
                            
                                UTF-8 in Windows 7 CMD [duplicate]
                            
                                Python script to convert from UTF-8 to ASCII [duplicate]
                            
                                How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?
                            
                                setting a UTF-8 in java and csv file [duplicate]
                            
                                Am I correctly supporting UTF-8 in my PHP apps?
                            
                                How to handle user input of invalid UTF-8 characters?
                            
                                Force XDocument to write to String with UTF-8 encoding
                            
                                Set UTF-8 display for Git GUI differences window
                            
                                Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10)
                            
                                UTF8 Script in PowerShell outputs incorrect characters
                            
                                How to use unicode in Android resource?
                            
                                R's read.csv prepending 1st column name with junk text [duplicate]
                            
                                Why does anyone use an encoding other than UTF-8? [closed]
                            
                                Windows-1252 to UTF-8 encoding

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I be sure of the file encoding?

Tags:

character-encoding

utf-8

file-encodings

ecantu

People also ask

3 Answers

Green Lei

Celada

Arthur Zennig

Recent Activity

Donate For Us