I need to determine if a text file's content is equal to one of these text encodings:
System.Text.Encoding.ASCII
System.Text.Encoding.BigEndianUnicode ' UTF-L 16
System.Text.Encoding.Default ' ANSI
System.Text.Encoding.Unicode ' UTF16
System.Text.Encoding.UTF32
System.Text.Encoding.UTF7
System.Text.Encoding.UTF8
I don't know how to read the byte marks of the files, I've seen snippets doing this but only can determine if file is ASCII or Unicode, therefore I need something more universal.
One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).
Open the file in Notepad. Click 'Save As...'. In the 'Encoding:' combo box you will see the current file format. Yes, I opened the file in notepad and selected the UTF-8 format and saved it.
There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16 ); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for ...
The first step is to load the file as a byte array instead of as a string. Strings are always stored in memory with UTF-16 encoding, so once it's loaded into a string, the original encoding is lost. Here's a simple example of one way to load a file into a byte array:
Dim data() As Byte = File.ReadAllBytes("test.txt")
Automatically determining the correct encoding for a given byte array is notoriously difficult. Sometimes, to be helpful, the author of the data will insert something called a BOM (Byte Order Mark) at the beginning of the data. If a BOM is present, that makes detecting the encoding painless, since each encoding uses a different BOM.
The easiest way to automatically detect the encoding from the BOM is to let the StreamReader
do it for you. In the constructor of the StreamReader
, you can pass True
for the detectEncodingFromByteOrderMarks
argument. Then you can get the encoding of the stream by accessing its CurrentEncoding
property. However, the CurrentEncoding
property won't work until after the StreamReader
has read the BOM. So you you first have to read past the BOM before you can get the encoding, for instance:
Public Function GetFileEncoding(filePath As String) As Encoding
Using sr As New StreamReader(filePath, True)
sr.Read()
Return sr.CurrentEncoding
End Using
End Function
However, the problem to this approach is that the MSDN seems to imply that the StreamReader
may only detect certain kinds of encodings:
The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. See the Encoding.GetPreamble method for more information.
Also, if the StreamReader
is incapable of determining the encoding from the BOM, or if the BOM isn't there, it will just default to UTF-8 encoding, without giving you any indication that it failed. If you need more granular control than that, you can pretty easily read the BOM and interpret it yourself. All you have to do is compare the first few bytes in the byte array with some known, expected BOM's to see if they match. Here is a list of some common BOM's:
EF BB BF
FE FF
FF FE
00 00 FE FF
FF FE 00 00
So, for instance, to see if a UTF-16 (little endian) BOM exists at the beginning of the byte array, you could simply do something like this:
If (data(0) = &HFF) And (data(1) = &HFE) Then
' Data starts with UTF-16 (little endian) BOM
End If
Conveniently, the Encoding
class in .NET contains a method called GetPreamble
which returns the BOM used by the encoding, so you don't even need to remember what they all are. So, to check if a byte-array starts with the BOM for Unicode (UTF-16, little-endian), you could just do this:
Function IsUtf16LittleEndian(data() as Byte) As Boolean
Dim bom() As Byte = Encoding.Unicode.GetPreamble()
If (data(0) = bom(0)) And (data(1) = bom(1) Then
Return True
Else
Return False
End If
End Function
Of course, the above function assumes that the data is at least two-bytes in length and the BOM is exactly two bytes. So, while it illustrates how to do it as clearly as possible, it's not the safest way to do it. To make it tolerant of different array lengths, especially since the BOM lengths themselves can vary from one encoding to the next, it would be safer to do something like this:
Function IsUtf16LittleEndian(data() as Byte) As Boolean
Dim bom() As Byte = Encoding.Unicode.GetPreamble()
Return data.Zip(bom, Function(x, y) x = y).All(Function(x) x)
End Function
So, the problem then becomes, how do you get a list of all the encodings? Well it just so happens that the .NET Encoding
class also provides a shared (static) method called GetEncodings
which returns a list of all of the supported encoding objects. Therefore, you could create a method which loops through all of the encoding objects, gets the BOM of each one and compares it to the byte array until you find one that matches. For instance:
Public Function DetectEncodingFromBom(data() As Byte) As Encoding
Return Encoding.GetEncodings().
Select(Function(info) info.GetEncoding()).
FirstOrDefault(Function(enc) DataStartsWithBom(data, enc))
End Function
Private Function DataStartsWithBom(data() As Byte, enc As Encoding) As Boolean
Dim bom() As Byte = enc.GetPreamble()
If bom.Length <> 0 Then
Return data.
Zip(bom, Function(x, y) x = y).
All(Function(x) x)
Else
Return False
End If
End Function
Once you make a function like that, then you could detect the encoding of a file like this:
Dim data() As Byte = File.ReadAllBytes("test.txt")
Dim detectedEncoding As Encoding = DetectEncodingFromBom(data)
If detectedEncoding Is Nothing Then
Console.WriteLine("Unable to detect encoding")
Else
Console.WriteLine(detectedEncoding.EncodingName)
End If
However, the problem remains, how do you automatically detect the correct encoding when there is no BOM? Technically it's recommended that you don't place a BOM at the beginning of your data when using UTF-8, and there is no BOM defined for any of the ANSI code pages. So it's certainly not out of the realm of possibility that a text file may not have a BOM. If all the files that you deal with are in English, it's probably safe to assume that if no BOM is present, then UTF-8 will suffice. However, if any of the files happen to use something else, without a BOM, then that won't work.
As you correctly observed, there are applications that still automatically detect the encoding even when no BOM is present, but they do it through heuristics (i.e. educated guessing) and sometimes they are not accurate. Basically they load the data using each encoding and then see if the data "looks" intelligible. This page offers some interesting insights on the problems inside the Notepad auto-detection algorithm. This page shows how you can tap into the COM-based auto-detection algorithm which Internet Explorer uses (in C#). Here is a list of some C# libraries that people have written which attempt to auto-detect the encoding of a byte array, which you may find helpful:
Even though this question was for C#, you may also find the answers to it useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With