Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting BIG Data XML file

Tags:

java

xml

bigdata

I have an XML file that has a compressed size of about 100 GB (uncompressed 1 TB). This file contains about 100 million entries in the following way:

<root>
  <entry>
    <id>1234</id>
     ...
  </entry>
  <entry>
    <id>1230</id>
    ...
  </entry
</root>

I would like to sort this file by id. What would be a good way to do so?

By the way, I can use a machine with 16 cores and 128 GB RAM.

like image 506
David Avatar asked Apr 27 '26 23:04

David


1 Answers

At this stage it's useful to remember the techniques that people used to sort magnetic tapes or decks of punched cards, in the days when the data was much larger than available direct access memory. (I once watched a team of operators sort a quarter of a million cards - about 120 trays). You basically need a combination of streaming, merging, and splitting, which are all operations available in principle using XSLT 3.0. There are two processors available, Saxon-EE and Exselt, and neither is yet a 100% complete implementation, so you'll be constrained by the limitations of the products more than the spec.

My instinct would be to go for a digit-by-digit sort. You don't say how long the id's used as sort keys are. "Digits" here of course doesn't have to mean decimal digits, but assuming decimal for simplicity, the basic idea is that you first split the file into 10 buckets based on the last digit of the sort key, then you process the buckets in sequence based on this ordering, this time sorting by the penultimate digit, and carry on for as many digits as there are in the key: one pass of the complete file for each digit in the sort key.

If the id's are dense then presumably with 100m keys they are about 8 digits long, that would be 8 passes and if we assume a processing speed of 10Gb/min, which is probably the best you can get from off-the-shelf XML parsers, then each pass of a 1Tb file is going to take 2 hours, so 8 passes would be 16 hours. But it might be better to use say base-100 so you split into 100 files on each pass, then you only have 4 passes.

The essential XSLT 3.0 code is:

<xsl:stream href="in.xml">
 <xsl:fork>
  <xsl:for-each-group select="record" 
       group-by="substring(key, $digit, 1)">
   <xsl:result-document href="temp{current-grouping-key()}">
     <xsl:sequence select="current-group()"/>
   </xsl:result-document>
 </xsl:for-each-group>
</xsl:fork>

Now the bad news: in Saxon-EE 9.7 this code probably isn't sufficiently optimised. Although in principle the items in each group could be streamed directly to the relevant serialised result-document, Saxon doesn't yet treat this case specially and will build each group in memory before processing it. I don't know if Exselt can do any better.

So is there an alternative? Well, perhaps we could try something like this:

  1. Split the file into N files: that is, put the first X/N items into file 1, the next X/N into file 2, and so on.
  2. Sort each file, conventionally in memory.
  3. Do a streamed merge of the resulting files, using xsl:merge.

I think that would work in Saxon. The first step can be done using <xsl:for-each-group group-adjacent="(position()-1) idiv $N"> which is fully streamed in Saxon.

This is essentially a 3-pass solution, in that each item is parsed and serialized three times. I would go for splitting the 1Tb file into 100 10Gb files. Doing an in-memory XSLT sort of a 10Gb is pushing it, but you've got some horsepower to play with. You could however run into Java addressing limits: arrays and strings have 1G limits, I think.

like image 137
Michael Kay Avatar answered Apr 29 '26 13:04

Michael Kay



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!