Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

File size restriction or limitation in C#

I want to generate XML file from one object (Contains nested collection) with large amount of data. but there is a limitation with XML that it can't exceed 50MB.

Are there any good way to do this?

Update : speed is not important, the main thing is split into 50MB for each file

like image 473
pang Avatar asked Aug 19 '09 02:08

pang


2 Answers

Ran into a similar requirement in my work. My best effort (intuitive, ease of implementation, relatively performant) is as follows. I basically write with an XmlWriter, monitoring the underlying stream. When it surpasses my file size limit, I complete the current Xml fragment, save file, close stream.

Then on a second pass, I load the full DOM into memory, and iteratively remove nodes and save document until it is of acceptable size.

For example

// arbitrary limit of 10MB
long FileSizeLimit = 10*1024*1024;

// open file stream to monitor file size
using (FileStream file = new FileStream("some.data.xml", FileMode.Create))
using (XmlWriter writer = XmlWriter.Create(file))
{
    writer.WriteStartElement("root");

    // while not greater than FileSizeLimit
    for (; file.Length < FileSizeLimit; )
    {
        // write contents
        writer.WriteElementString(
            "data", 
            string.Format("{0}/{0}/{0}/{0}/{0}", Guid.NewGuid()));
    }

    // complete fragment; this is the trickiest part, 
    // since a complex document may have an arbitrarily
    // long tail, and cannot be known during file size
    // sampling above
    writer.WriteEndElement();
    writer.Flush();
}

// iteratively reduce document size
// NOTE: XDocument will load full DOM into memory
XDocument document = XDocument.Load("some.data.xml");
XElement root = document.Element("root");
for (; new FileInfo("some.data.xml").Length > FileSizeLimit; )
{
    root.LastNode.Remove();
    document.Save("some.data.xml");
}

There are ways to improve this; one possibility if memory is a constraint would be to rewrite the iterative bit to take a count of nodes actually written in first pass, then re-write the file less one element, and continue until full document is of desired size.

This last recommendation may be the route to go, especially if you already need to track elements written to resume writing in another file.

Hope this helps!


EDIT

Although intuitive, and easier to implement, I felt it worth investigating the optimization mentioned above. This is what I got.

An extension method that helps write ancestor nodes (ie container nodes, and all other kinds of markup),

// performs a shallow copy of a given node. courtesy of Mark Fussell
// http://blogs.msdn.com/b/mfussell/archive/2005/02/12/371546.aspx
public static void WriteShallowNode(this XmlWriter writer, XmlReader reader)
{

    switch (reader.NodeType)
    {
        case XmlNodeType.Element:
            writer.WriteStartElement(
                reader.Prefix, 
                reader.LocalName, 
                reader.NamespaceURI);
            writer.WriteAttributes(reader, true);
            if (reader.IsEmptyElement)
            {
                writer.WriteEndElement();
            }
            break;
        case XmlNodeType.Text: writer.WriteString(reader.Value); break;
        case XmlNodeType.Whitespace:
        case XmlNodeType.SignificantWhitespace:
            writer.WriteWhitespace(reader.Value);
            break;
        case XmlNodeType.CDATA: writer.WriteCData(reader.Value); break;
        case XmlNodeType.EntityReference: 
            writer.WriteEntityRef(reader.Name); 
            break;
        case XmlNodeType.XmlDeclaration:
        case XmlNodeType.ProcessingInstruction:
            writer.WriteProcessingInstruction(reader.Name, reader.Value);
            break;
        case XmlNodeType.DocumentType:
            writer.WriteDocType(
                reader.Name, 
                reader.GetAttribute("PUBLIC"), 
                reader.GetAttribute("SYSTEM"), 
                reader.Value);
            break;
        case XmlNodeType.Comment: writer.WriteComment(reader.Value); break;
        case XmlNodeType.EndElement: writer.WriteFullEndElement(); break;
    }
}

and a method that will perform the trimming (not an extension method, since extending any of parameter types would be a bit ambiguous).

// trims xml file to specified file size. does so by 
// counting number of "victim candidates" and then iteratively
// trimming these candidates one at a time until resultant
// file size is just less than desired limit. does not
// consider nested victim candidates.
public static void TrimXmlFile(string filename, long size, string trimNodeName)
{
    long fileSize = new FileInfo(filename).Length;
    long workNodeCount = 0;

    // count number of victim elements in xml
    if (fileSize > size)
    {
        XmlReader countReader = XmlReader.Create(filename);
        for (; countReader.Read(); )
        {
            if (countReader.NodeType == XmlNodeType.Element && 
                countReader.Name == trimNodeName)
            {
                workNodeCount++;
                countReader.Skip();
            }
        }
        countReader.Close();
    }

    // if greater than desired file size, and there is at least
    // one victim candidate
    string workFilename = filename+".work";
    for (; 
        fileSize > size && workNodeCount > 0; 
        fileSize = new FileInfo(filename).Length)
    {
        workNodeCount--;
        using (FileStream readFile = new FileStream(filename, FileMode.Open))
        using (FileStream writeFile = new FileStream(
            workFilename, 
            FileMode.Create))
        {
            XmlReader reader = XmlReader.Create(readFile);
            XmlWriter writer = XmlWriter.Create(writeFile);

            long j = 0;
            bool hasAlreadyRead = false;
            for (; (hasAlreadyRead) || reader.Read(); )
            {

                // if node is a victim node
                if (reader.NodeType == XmlNodeType.Element && 
                    reader.Name == trimNodeName)
                {
                    // if we have not surpassed this iteration's
                    // allowance, preserve node
                    if (j < workNodeCount)
                    {
                        writer.WriteNode(reader, true);
                    }
                    j++;

                    // if we have exceeded this iteration's
                    // allowance, trim node (and whitespace)
                    if (j >= workNodeCount)
                    {
                        reader.ReadToNextSibling(trimNodeName);
                    }
                    hasAlreadyRead = true;
                }
                else
                {
                    // some other xml content we should preserve
                    writer.WriteShallowNode(reader);
                    hasAlreadyRead = false;
                }
            }
            writer.Flush();
        }
        File.Copy(workFilename, filename, true);
    }
    File.Delete(workFilename);
}

If your Xml contains whitespace formatting, any whitespace between last remaining victim node and closing container element tag is lost. This can be mitigated by altering the skip clause (moving the j++ statement post skip), but then you end up with additional whitespace. The solution presented above generates a minimal file size replica of source file.

like image 62
johnny g Avatar answered Oct 22 '22 10:10

johnny g


You can write big xml file with XmlWriter or XDocument without any problem.

Here a sample example. This example generates a 63MB xml file in less than 5 seconds. For this example, I use the class XmlWriter.

using (XmlWriter writer = XmlWriter.Create("YourFilePath"))
{
    writer.WriteStartDocument();

    writer.WriteStartElement("Root");

    for (int i = 0; i < 1000000; i++) //Write one million nodes.
    {
        writer.WriteStartElement("Root");
        writer.WriteAttributeString("value", "Value #" + i.ToString());
        writer.WriteString("Inner Text #" + i.ToString());
        writer.WriteEndElement();
    }
    writer.WriteEndElement();

    writer.WriteEndDocument();
}
like image 37
Francis B. Avatar answered Oct 22 '22 11:10

Francis B.