Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Standard xml parser has very low performance in Golang

Tags:

I have a 100Gb sized xml file and parse it with SAX method in go with this code

file, err := os.Open(filename)
handle(err)
defer file.Close()
buffer := bufio.NewReaderSize(file, 1024*1024*256) // 33554432
decoder := xml.NewDecoder(buffer)
for {
        t, _ := decoder.Token()
        if t == nil {
            break
        }
        switch se := t.(type) {
        case xml.StartElement:
            if se.Name.Local == "House" {
                house := House{}
                err := decoder.DecodeElement(&house, &se)
                handle(err)
            }
        }
    }

But golang working very slow, its seems by execution time and disk usage. My hdd capable to read data with speed around 100-120 mb/s, but golang uses only 10-13 mb/s. For experiment i rewrite this code in c#:

using (XmlReader reader = XmlReader.Create(filename)
            {
                while (reader.Read())
                {
                    switch (reader.NodeType)
                    {
                        case XmlNodeType.Element:
                            if (reader.Name == "House")
                            {
                                //Code
                            }
                            break;
                    }
                }
            }

And i got full hdd loaded, c# read data with 100-110mb/s speed. And execution time around 10 times lower.

How can i improve xml parse performance using golang?

like image 490
Grey Avatar asked Sep 09 '17 21:09

Grey


1 Answers

These 5 things can help increase speed using the encoding/xml library:
(Tested against XMB with 75k entries, 20MB, %s are applied to previous bullet)

  1. Use well defined structures
  2. Implement xml.Unmarshaller on all your structures
    • Lots of code
    • Saves 20% time and 15% allocs
  3. Replace d.DecodeElement(&foo, &token) with foo.UnmarshallXML(d, &token)
    • Almost 100% safe
    • Saves 10% time & allocs
  4. Use d.RawToken() instead of d.Token()
    • Needs manual handling of nested objects and namespaces
    • Saves 10% time & 20% allocs
  5. If use use d.Skip(), reimplement it using d.RawToken()

I reduced time and allocs by 40% on my specific usecase at the cost of more code, boileplate, and potentially worse handling of corner cases, but my inputs are fairly consistent, however it's not enough.

benchstat first.bench.txt parseraw.bench.txt 
name          old time/op    new time/op    delta
Unmarshal-16     1.06s ± 6%     0.66s ± 4%  -37.55%  (p=0.008 n=5+5)

name          old alloc/op   new alloc/op   delta
Unmarshal-16     461MB ± 0%     280MB ± 0%  -39.20%  (p=0.029 n=4+4)

name          old allocs/op  new allocs/op  delta
Unmarshal-16     8.42M ± 0%     5.03M ± 0%  -40.26%  (p=0.016 n=4+5)

On my experiments, the lack of memoizing issue is the reason for large time/allocs on the XML parser which slows down significantly, mostly because of Go copying by value.

like image 53
Goodwine Avatar answered Oct 13 '22 00:10

Goodwine