Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect the encoding of a text file using C#

Tags:

c#

encoding

utf-8

I have a set of markdown files to be passed to jekyll project , need to find the encoding format of them i.e UTF-8 with BOM or UTF-8 without BOM or ANSI using a program or a API .

if i pass the location of the files , the files have to be listed,read and the encoding should be produced as result .

Is there any Code or API for it ?

i have already tried the sr.CurrentEncoding for stream reader as mentioned in Effective way to find any file's Encoding but the result varies with the result from a notepad++ result .

also tried to use https://github.com/errepi/ude ( Mozilla Universal Charset Detector) as suggested in https://social.msdn.microsoft.com/Forums/vstudio/en-US/862e3342-cc88-478f-bca2-e2de6f60d2fb/detect-encoding-of-the-file?forum=csharpgeneral by implementing the ude.dll in the c# project but the result is not effective as in notepad++ , the file encoding is shown as utf-8 , but from the program , the result is utf-8 with BOM.

but i should get same result from both ways , so where the problem has occurred?

like image 661
Deepak Raj Avatar asked Jan 22 '18 11:01

Deepak Raj


2 Answers

Detecting encoding is always a tricky business, but detecting BOMs is dead simple. To get the BOM as byte array, just use the GetPreamble() function of the encoding objects. This should allow you to detect a whole range of encodings by preamble.

Now, as for detecting UTF-8 without preamble, actually that's not very hard either. See, UTF8 has strict bitwise rules about what values are expected in a valid sequence, and you can initialize a UTF8Encoding object in a way that will fail by throwing an exception when these sequences are incorrect.

So if you first do the BOM check, and then the strict decoding check, and finally fall back to Win-1252 encoding (what you call "ANSI") then your detection is done.

Byte[] bytes = File.ReadAllBytes(filename);
Encoding encoding = null;
String text = null;
// Test UTF8 with BOM. This check can easily be copied and adapted
// to detect many other encodings that use BOMs.
UTF8Encoding encUtf8Bom = new UTF8Encoding(true, true);
Boolean couldBeUtf8 = true;
Byte[] preamble = encUtf8Bom.GetPreamble();
Int32 prLen = preamble.Length;
if (bytes.Length >= prLen && preamble.SequenceEqual(bytes.Take(prLen)))
{
    // UTF8 BOM found; use encUtf8Bom to decode.
    try
    {
        // Seems that despite being an encoding with preamble,
        // it doesn't actually skip said preamble when decoding...
        text = encUtf8Bom.GetString(bytes, prLen, bytes.Length - prLen);
        encoding = encUtf8Bom;
    }
    catch (ArgumentException)
    {
        // Confirmed as not UTF-8!
        couldBeUtf8 = false;
    }
}
// use boolean to skip this if it's already confirmed as incorrect UTF-8 decoding.
if (couldBeUtf8 && encoding == null)
{
    // test UTF-8 on strict encoding rules. Note that on pure ASCII this will
    // succeed as well, since valid ASCII is automatically valid UTF-8.
    UTF8Encoding encUtf8NoBom = new UTF8Encoding(false, true);
    try
    {
        text = encUtf8NoBom.GetString(bytes);
        encoding = encUtf8NoBom;
    }
    catch (ArgumentException)
    {
        // Confirmed as not UTF-8!
    }
}
// fall back to default ANSI encoding.
if (encoding == null)
{
    encoding = Encoding.GetEncoding(1252);
    text = encoding.GetString(bytes);
}

Note that Windows-1252 (US / Western European ANSI) is a one-byte-per-character encoding, meaning everything in it produces a technically valid character, so unless you go for heuristic methods, no further detection can be done on it to distinguish it from other one-byte-per-character encodings.

like image 124
Nyerguds Avatar answered Oct 19 '22 22:10

Nyerguds


Necromancing.

  • First, you check the Byte-Order Mark:
  • If that doesn't work, you can try to infer the encoding from the text-content with Mozilla Universal Charset Detector C# port.
  • If that doesn't work, you just return the CurrentCulture/InstalledUiCulture/System-Encoding - or whatever.
  • if the system-encoding doesn't work, we can either return ASCII or UTF8. Since entries 0-127 of UTF8 are identical to ASCII, we so simply return UTF8.

Example (DetectOrGuessEncoding):

namespace SQLMerge
{


    class EncodingDetector
    {


        public static System.Text.Encoding BomInfo(string srcFile)
        {
            return BomInfo(srcFile, false);
        } // End Function BomInfo 



        public static System.Text.Encoding BomInfo(string srcFile, bool thorough)
        {
            byte[] b = new byte[5];

            using (System.IO.FileStream file = new System.IO.FileStream(srcFile, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read))
            {
                int numRead = file.Read(b, 0, 5);
                if (numRead < 5)
                    System.Array.Resize(ref b, numRead);

                file.Close();
            } // End Using file 

            if (b.Length >= 4 && b[0] == 0x00 && b[1] == 0x00 && b[2] == 0xFE && b[3] == 0xFF) // UTF32-BE 
                return System.Text.Encoding.GetEncoding("utf-32BE"); // UTF-32, big-endian 
            else if (b.Length >= 4 && b[0] == 0xFF && b[1] == 0xFE && b[2] == 0x00 && b[3] == 0x00) // UTF32-LE
                return System.Text.Encoding.UTF32; // UTF-32, little-endian
            // https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-14    
            else if (b.Length >= 4 && b[0] == 0x2b && b[1] == 0x2f && b[2] == 0x76 && (b[3] == 0x38 || b[3] == 0x39 || b[3] == 0x2B || b[3] == 0x2F)) // UTF7
                return System.Text.Encoding.UTF7;  // UTF-7
            else if (b.Length >= 3 && b[0] == 0xEF && b[1] == 0xBB && b[2] == 0xBF) // UTF-8
                return System.Text.Encoding.UTF8;  // UTF-8
            else if (b.Length >= 2 && b[0] == 0xFE && b[1] == 0xFF) // UTF16-BE
                return System.Text.Encoding.BigEndianUnicode; // UTF-16, big-endian
            else if (b.Length >= 2 && b[0] == 0xFF && b[1] == 0xFE) // UTF16-LE
                return System.Text.Encoding.Unicode; // UTF-16, little-endian

            // Maybe there is a future encoding ...
            // PS: The above yields more than this - this doesn't find UTF7 ...
            if (thorough)
            {
                System.Collections.Generic.List<System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]>> lsPreambles = 
                    new System.Collections.Generic.List<System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]>>();

                foreach (System.Text.EncodingInfo ei in System.Text.Encoding.GetEncodings())
                {
                    System.Text.Encoding enc = ei.GetEncoding();

                    byte[] preamble = enc.GetPreamble();

                    if (preamble == null)
                        continue;

                    if (preamble.Length == 0)
                        continue;

                    if (preamble.Length > b.Length)
                        continue;

                    System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]> kvp =
                        new System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]>(enc, preamble);

                    lsPreambles.Add(kvp);
                } // Next ei

                // li.Sort((a, b) => a.CompareTo(b)); // ascending sort
                // li.Sort((a, b) => b.CompareTo(a)); // descending sort
                lsPreambles.Sort(
                    delegate (
                        System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]> kvp1, 
                        System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]> kvp2)
                    {
                        return kvp2.Value.Length.CompareTo(kvp1.Value.Length);
                    }
                );


                for (int j = 0; j < lsPreambles.Count; ++j)
                {
                    for (int i = 0; i < lsPreambles[j].Value.Length; ++i)
                    {
                        if (b[i] != lsPreambles[j].Value[i])
                        {
                            goto NEXT_J_AND_NOT_NEXT_I;
                        }
                    } // Next i 

                    return lsPreambles[j].Key;
                    NEXT_J_AND_NOT_NEXT_I: continue;
                } // Next j 

            } // End if (thorough)

            return null;
        } // End Function BomInfo 


        public static System.Text.Encoding DetectOrGuessEncoding(string fileName)
        {
            return DetectOrGuessEncoding(fileName, false);
        }


        public static System.Text.Encoding DetectOrGuessEncoding(string fileName, bool withOutput)
        {
            if (!System.IO.File.Exists(fileName))
                return null;


            System.ConsoleColor origBack = System.ConsoleColor.Black;
            System.ConsoleColor origFore = System.ConsoleColor.White;
            

            if (withOutput)
            {
                origBack = System.Console.BackgroundColor;
                origFore = System.Console.ForegroundColor;
            }
            
            // System.Text.Encoding systemEncoding = System.Text.Encoding.Default; // Returns hard-coded UTF8 on .NET Core ... 
            System.Text.Encoding systemEncoding = GetSystemEncoding();
            System.Text.Encoding enc = BomInfo(fileName);
            if (enc != null)
            {
                if (withOutput)
                {
                    System.Console.BackgroundColor = System.ConsoleColor.Green;
                    System.Console.ForegroundColor = System.ConsoleColor.White;
                    System.Console.WriteLine(fileName);
                    System.Console.WriteLine(enc);
                    System.Console.BackgroundColor = origBack;
                    System.Console.ForegroundColor = origFore;
                }

                return enc;
            }

            using (System.IO.Stream strm = System.IO.File.OpenRead(fileName))
            {
                UtfUnknown.DetectionResult detect = UtfUnknown.CharsetDetector.DetectFromStream(strm);

                if (detect != null && detect.Details != null && detect.Details.Count > 0 && detect.Details[0].Confidence < 1)
                {
                    if (withOutput)
                    {
                        System.Console.BackgroundColor = System.ConsoleColor.Red;
                        System.Console.ForegroundColor = System.ConsoleColor.White;
                        System.Console.WriteLine(fileName);
                        System.Console.WriteLine(detect);
                        System.Console.BackgroundColor = origBack;
                        System.Console.ForegroundColor = origFore;
                    }

                    foreach (UtfUnknown.DetectionDetail detail in detect.Details)
                    {
                        if (detail.Encoding == systemEncoding
                            || detail.Encoding == System.Text.Encoding.UTF8
                        )
                            return detail.Encoding;
                    }

                    return detect.Details[0].Encoding;
                }
                else if (detect != null && detect.Details != null && detect.Details.Count > 0)
                {
                    if (withOutput)
                    {
                        System.Console.BackgroundColor = System.ConsoleColor.Green;
                        System.Console.ForegroundColor = System.ConsoleColor.White;
                        System.Console.WriteLine(fileName);
                        System.Console.WriteLine(detect);
                        System.Console.BackgroundColor = origBack;
                        System.Console.ForegroundColor = origFore;
                    }

                    return detect.Details[0].Encoding;
                }

                enc = GetSystemEncoding();

                if (withOutput)
                {
                    System.Console.BackgroundColor = System.ConsoleColor.DarkRed;
                    System.Console.ForegroundColor = System.ConsoleColor.Yellow;
                    System.Console.WriteLine(fileName);
                    System.Console.Write("Assuming ");
                    System.Console.Write(enc.WebName);
                    System.Console.WriteLine("...");
                    System.Console.BackgroundColor = origBack;
                    System.Console.ForegroundColor = origFore;
                }

                return systemEncoding;
            } // End Using strm 

        } // End Function DetectOrGuessEncoding 


        public static System.Text.Encoding GetSystemEncoding()
        {
            // The OEM code page for use by legacy console applications
            // int oem = System.Globalization.CultureInfo.CurrentCulture.TextInfo.OEMCodePage;

            // The ANSI code page for use by legacy GUI applications
            // int ansi = System.Globalization.CultureInfo.InstalledUICulture.TextInfo.ANSICodePage; // Machine 
            int ansi = System.Globalization.CultureInfo.CurrentCulture.TextInfo.ANSICodePage; // User 

            try
            {
                // https://stackoverflow.com/questions/38476796/how-to-set-net-core-in-if-statement-for-compilation
#if ( NETSTANDARD && !NETSTANDARD1_0 )  || NETCORE || NETCOREAPP3_0 || NETCOREAPP3_1 
                System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
#endif

                System.Text.Encoding enc = System.Text.Encoding.GetEncoding(ansi);
                return enc;
            }
            catch (System.Exception)
            { }


            try
            {

                foreach (System.Text.EncodingInfo ei in System.Text.Encoding.GetEncodings())
                {
                    System.Text.Encoding e = ei.GetEncoding();

                    // 20'127: US-ASCII 
                    if (e.WindowsCodePage == ansi && e.CodePage != 20127)
                    {
                        return e;
                    }

                }
            }
            catch (System.Exception)
            { }

            // return System.Text.Encoding.GetEncoding("iso-8859-1");
            return System.Text.Encoding.UTF8;
        } // End Function GetSystemEncoding 


    } // End Class 


}
like image 40
Stefan Steiger Avatar answered Oct 19 '22 22:10

Stefan Steiger