Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can an XmlSerializer pool strings to avoid large duplicate strings?

I've got some very large XML files which I read using a System.Xml.Serialization.XmlSerializer. It's pretty fast (well, fast enough), but I want it to pool strings, as some long strings occur very many times.

The XML looks somewhat like this:

<Report>
    <Row>
        <Column name="A long column name!">hey</Column>
        <Column name="Another long column name!!">7</Column>
        <Column name="A third freaking long column name!!!">hax</Column>
        <Column name="Holy cow, can column names really be this long!?">true</Column>
    </Row>
    <Row>
        <Column name="A long column name!">yo</Column>
        <Column name="Another long column name!!">53</Column>
        <Column name="A third freaking long column name!!!">omg</Column>
        <Column name="Holy cow, can column names really be this long!?">true</Column>
    </Row>
    <!-- ... ~200k more rows go here... -->
</Report>

And the classes the XML is deserialized into look somewhat like this:

class Report 
{
    public Row[] Rows { get; set; }
}
class Row 
{
    public Column[] Columns { get; set; }
}
class Column 
{
    public string Name { get; set; }
    public string Value { get; set; }
}

When the data is imported, a new string is allocated for every column name. I can see why that is so, but according to my calculations, that means a few duplicated strings make up some ~50% of the memory used by the imported data. I'd consider it a very good trade-off to spend some extra CPU cycles to cut memory consumption in half. Is there some way to have the XmlSerializer pool strings, so that duplicates are discarded and can be reclaimed the next time a gen0 GC occurs?


Also, some final notes:

  • I can't change the XML schema. It's an exported file from a third-party vendor.

  • I know could (theoretically) make a faster parser using an XmlReader instead, and it would not only allow me to do my own string pooling, but also to process data during mid-import so that not all 200k lines have to be saved in RAM until I've read the entire file. Still, I'd rather not spend the time writing and debugging a custom parser. The real XML is a bit more complicated than the example, so it's quite a non-trivial task. And as mentioned above - the XmlSerializer really does perform well enough for my purposes, I'm just wondering if there is an easy way to tweak it a little.

  • I could write a string pool of my own and use it in the Column.Name setter, but I'd rather not as (1) that means fiddling with auto-generated code, and (2) it opens up for a slew of problems related to concurrency and memory leaks.

  • And no, by "pooling", I don't mean "interning" as that can cause memory leaks.

like image 612
gustafc Avatar asked Oct 14 '22 16:10

gustafc


2 Answers

Personally, I wouldn't hesitate to hand-crank the entities - either by assuming ownership of the generated code, or doing it manually (and getting rid of the arrays ;-p).

Re concurrency - you could perhaps have a thread-static pool? AFAIK, XmlSerializer just uses the one thread, so this should be fine. It would also allow you to throw the pool away when you're done. So then you could have something like a static pool, but per thread. Then perhaps tweak the setters:

class Column 
{
    private string name, value;
    public string Name {
       get { return this.name; }
       set { this.name= MyPool.Get(value); }
    }
    public string Value{
       get { return this.value; }
       set { this.value = MyPool.Get(value); }
    }
}

where the static MyPool.Get method talks to a static field (HashSet<string>, presumably) decorated with [ThreadStatic].

like image 121
Marc Gravell Avatar answered Oct 20 '22 15:10

Marc Gravell


I know its old thread but I found a nice way for it:

Create XmlReader that override the Value Property in a way that before the value is returned, you check if its exist in your string pool and then return it.

The Value property of XmlReader from msdn:

The value returned depends on the NodeType of the node. The following table lists node types that have a value to return. All other node types return String.Empty.

For example, for Attribute NodeType it returned the value of the attribute.

Hence the implementation will look like this:

public class StringPoolXmlTextReader : XmlTextReader
{
    private readonly Dictionary<string, string> stringPool = new Dictionary<string, string>();

    internal StringPoolXmlTextReader(Stream stream)
        : base(stream)
    {
    }

    public override string Value
    {
        get
        {
            if (this.NodeType == XmlNodeType.Attribute)
                return GetOrAddFromPool(base.Value);

            return base.Value;
        }
    }

    private string GetOrAddFromPool(string str)
    {
        if (str == null)
            return null;

        if (stringPool.TryGetValue(str, out var res) == false)
        {
            res = str;
            stringPool.Add(str, str);
        }

        return res;
    }
}

How to use:

using (var stream = File.Open(@"..\..\Report.xml", FileMode.Open))
{
   var reader = new StringPoolXmlTextReader(stream);
   var ser = new XmlSerializer(typeof(Report));
   var data = (Report)ser.Deserialize(reader);
}

Performance: I have checked the performance for 200K rows with random column values and I found that the deserialize time was the same and the Report memory went down from 78,551,460 bytes to 48,890,016 bytes (decreased by ~38%).

Notes:

  1. The example inherit from XmlTextReader but you can inherit from any XmlReader
  2. You can also use the string pool for the column values by override Value property like this public override string Value => GetOrAddFromPool(base.Value); but it can increase the deserialize time by about 20% when the values are not duplicated (like in my test when they are random).
like image 27
itaiy Avatar answered Oct 20 '22 16:10

itaiy