Standard xml parser has very low performance in Golang

Question

I have a 100Gb sized xml file and parse it with SAX method in go with this code

file, err := os.Open(filename)
handle(err)
defer file.Close()
buffer := bufio.NewReaderSize(file, 1024*1024*256) // 33554432
decoder := xml.NewDecoder(buffer)
for {
        t, _ := decoder.Token()
        if t == nil {
            break
        }
        switch se := t.(type) {
        case xml.StartElement:
            if se.Name.Local == "House" {
                house := House{}
                err := decoder.DecodeElement(&house, &se)
                handle(err)
            }
        }
    }

But golang working very slow, its seems by execution time and disk usage. My hdd capable to read data with speed around 100-120 mb/s, but golang uses only 10-13 mb/s. For experiment i rewrite this code in c#:

using (XmlReader reader = XmlReader.Create(filename)
            {
                while (reader.Read())
                {
                    switch (reader.NodeType)
                    {
                        case XmlNodeType.Element:
                            if (reader.Name == "House")
                            {
                                //Code
                            }
                            break;
                    }
                }
            }

And i got full hdd loaded, c# read data with 100-110mb/s speed. And execution time around 10 times lower.

How can i improve xml parse performance using golang?

Goodwine · Accepted Answer

These 5 things can help increase speed using the encoding/xml library:
(Tested against XMB with 75k entries, 20MB, %s are applied to previous bullet)

Use well defined structures
Implement xml.Unmarshaller on all your structures
- Lots of code
- Saves 20% time and 15% allocs
Replace d.DecodeElement(&foo, &token) with foo.UnmarshallXML(d, &token)
- Almost 100% safe
- Saves 10% time & allocs
Use d.RawToken() instead of d.Token()
- Needs manual handling of nested objects and namespaces
- Saves 10% time & 20% allocs
If use use d.Skip(), reimplement it using d.RawToken()

I reduced time and allocs by 40% on my specific usecase at the cost of more code, boileplate, and potentially worse handling of corner cases, but my inputs are fairly consistent, however it's not enough.

benchstat first.bench.txt parseraw.bench.txt 
name          old time/op    new time/op    delta
Unmarshal-16     1.06s ± 6%     0.66s ± 4%  -37.55%  (p=0.008 n=5+5)

name          old alloc/op   new alloc/op   delta
Unmarshal-16     461MB ± 0%     280MB ± 0%  -39.20%  (p=0.029 n=4+4)

name          old allocs/op  new allocs/op  delta
Unmarshal-16     8.42M ± 0%     5.03M ± 0%  -40.26%  (p=0.016 n=4+5)

On my experiments, the lack of memoizing issue is the reason for large time/allocs on the XML parser which slows down significantly, mostly because of Go copying by value.

Standard xml parser has very low performance in Golang

Tags:

Grey

1 Answers

Goodwine

Recent Activity

Donate For Us

Standard xml parser has very low performance in Golang

Tags:

Grey

1 Answers

Goodwine

Related questions

Recent Activity

Donate For Us