I have a XML document, which is very big (about 120M), and I do not want to load it into memory at once. My purpose is to check whether this file is using valid UTF-8 encoding.
Any ideas to have a quick check without reading the whole file into memory in the form of byte[]
?
I am using VSTS 2008 and C#.
When using XMLDocument
to load an XML document, which contains invalid byte sequences, there is an exception, but when reading all content into a byte array and then checking against UTF-8, there is no exception, any ideas?
Here is a screenshot showing the content of my XML file, or you can download a copy of the file from here
EDIT 1:
class Program
{
public static byte[] RawReadingTest(string fileName)
{
byte[] buff = null;
try
{
FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read);
BinaryReader br = new BinaryReader(fs);
long numBytes = new FileInfo(fileName).Length;
buff = br.ReadBytes((int)numBytes);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return buff;
}
static void XMLTest()
{
try
{
XmlDocument xDoc = new XmlDocument();
xDoc.Load("c:\\abc.xml");
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
static void Main()
{
try
{
XMLTest();
Encoding ae = Encoding.GetEncoding("utf-8");
string filename = "c:\\abc.xml";
ae.GetString(RawReadingTest(filename));
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return;
}
}
EDIT 2: When using new UTF8Encoding(true, true)
there will be an exception, but when using new UTF8Encoding(false, true)
, there is no exception thrown. I am confused, because it should be the 2nd parameter which controls whether an exception is thrown (if there are invalid byte sequences), why the 1st parameter matters?
public static void TestTextReader2()
{
try
{
// Create an instance of StreamReader to read from a file.
// The using statement also closes the StreamReader.
using (StreamReader sr = new StreamReader(
"c:\\a.xml",
new UTF8Encoding(true, true)
))
{
int bufferSize = 10 * 1024 * 1024; //could be anything
char[] buffer = new char[bufferSize];
// Read from the file until the end of the file is reached.
int actualsize = sr.Read(buffer, 0, bufferSize);
while (actualsize > 0)
{
actualsize = sr.Read(buffer, 0, bufferSize);
}
}
}
catch (Exception e)
{
// Let the user know what went wrong.
Console.WriteLine("The file could not be read:");
Console.WriteLine(e.Message);
}
}
var buffer = new char[32768] ;
using (var stream = new StreamReader (pathToFile,
new UTF8Encoding (true, true)))
{
while (true)
try
{
if (stream.Read (buffer, 0, buffer.Length) == 0)
return GoodUTF8File ;
}
catch (ArgumentException)
{
return BadUTF8File ;
}
}
@George2 I think they mean a solution like the following (which I haven't tested).
Handling the transition between buffers (i.e. caching extra bytes/partial chars between reads) is the responsibillity and an internal implementation detail of the StreamReader implementation.
using System;
using System.IO;
using System.Text;
class Test
{
public static void Main()
{
try
{
// Create an instance of StreamReader to read from a file.
// The using statement also closes the StreamReader.
using (StreamReader sr = new StreamReader(
"TestFile.txt",
Encoding.UTF8
))
{
const int bufferSize = 1000; //could be anything
char[] buffer = new char[bufferSize];
// Read from the file until the end of the file is reached.
while (bufferSize == sr.Read(buffer, bufferSize, 0))
{
//successfuly decoded another buffer's-worth of data
}
}
}
catch (Exception e)
{
// Let the user know what went wrong.
Console.WriteLine("The file could not be read:");
Console.WriteLine(e.Message);
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With