Logo Questions Linux Laravel Mysql Ubuntu Git Menu

decode a file stream using UTF-8

I have a XML document, which is very big (about 120M), and I do not want to load it into memory at once. My purpose is to check whether this file is using valid UTF-8 encoding.

Any ideas to have a quick check without reading the whole file into memory in the form of byte[]?

I am using VSTS 2008 and C#.

When using XMLDocument to load an XML document, which contains invalid byte sequences, there is an exception, but when reading all content into a byte array and then checking against UTF-8, there is no exception, any ideas?

Here is a screenshot showing the content of my XML file, or you can download a copy of the file from here

enter image description here


class Program
    public static byte[] RawReadingTest(string fileName)
        byte[] buff = null;

            FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read);
            BinaryReader br = new BinaryReader(fs);
            long numBytes = new FileInfo(fileName).Length;
            buff = br.ReadBytes((int)numBytes);
        catch (Exception ex)

        return buff;

    static void XMLTest()
            XmlDocument xDoc = new XmlDocument();
        catch (Exception ex)

    static void Main()
            Encoding ae = Encoding.GetEncoding("utf-8");
            string filename = "c:\\abc.xml";
        catch (Exception ex)


EDIT 2: When using new UTF8Encoding(true, true) there will be an exception, but when using new UTF8Encoding(false, true), there is no exception thrown. I am confused, because it should be the 2nd parameter which controls whether an exception is thrown (if there are invalid byte sequences), why the 1st parameter matters?

    public static void TestTextReader2()
            // Create an instance of StreamReader to read from a file.
            // The using statement also closes the StreamReader.
            using (StreamReader sr = new StreamReader(
                new UTF8Encoding(true, true)
                int bufferSize = 10 * 1024 * 1024; //could be anything
                char[] buffer = new char[bufferSize];
                // Read from the file until the end of the file is reached.
                int actualsize = sr.Read(buffer, 0, bufferSize);
                while (actualsize > 0)
                    actualsize = sr.Read(buffer, 0, bufferSize);
        catch (Exception e)
            // Let the user know what went wrong.
            Console.WriteLine("The file could not be read:");

like image 810
George2 Avatar asked May 18 '09 05:05


2 Answers

var buffer = new char[32768] ;

using (var stream = new StreamReader (pathToFile, 
    new UTF8Encoding (true, true)))
    while (true)
        if (stream.Read (buffer, 0, buffer.Length) == 0)
            return GoodUTF8File ;
    catch (ArgumentException)
        return BadUTF8File ;
like image 93
Anton Tykhyy Avatar answered Sep 21 '22 23:09

Anton Tykhyy

@George2 I think they mean a solution like the following (which I haven't tested).

Handling the transition between buffers (i.e. caching extra bytes/partial chars between reads) is the responsibillity and an internal implementation detail of the StreamReader implementation.

using System;
using System.IO;
using System.Text;

class Test 
    public static void Main() 
            // Create an instance of StreamReader to read from a file.
            // The using statement also closes the StreamReader.
            using (StreamReader sr = new StreamReader(
                const int bufferSize = 1000; //could be anything
                char[] buffer = new char[bufferSize];
                // Read from the file until the end of the file is reached.
                while (bufferSize == sr.Read(buffer, bufferSize, 0)) 
                    //successfuly decoded another buffer's-worth of data
        catch (Exception e) 
            // Let the user know what went wrong.
            Console.WriteLine("The file could not be read:");
like image 41
ChrisW Avatar answered Sep 23 '22 23:09
