Writing more efficient xquery code (avoiding redundant iteration)

Question

Here's a simplified version of a problem I'm working on: I have a bunch of xml data that encodes information about people. Each person is uniquely identified by an 'id' attribute, but they may go by many names. For example, in one document, I might find

<person id=1>Paul Mcartney</person>
<person id=2>Ringo Starr</person>

And in another I might find:

<person id=1>Sir Paul McCartney</person>
<person id=2>Richard Starkey</person>

I want to use xquery to produce a new document that lists every name associated with a given id. i.e.:

<person id=1>
    <name>Paul McCartney</name>
    <name>Sir Paul McCartney</name>
    <name>James Paul McCartney</name>
</person>
<person id=2>
    ...
</person>

The way I'm doing this now in xquery is something like this (pseudocode-esque):

let $ids := distinct-terms( [all the id attributes on people] )
for $id in $ids
    return <person id={$id}>
    {
    for $unique-name in distinct-values
            (
            for $name in ( [all names] )
            where $name/@id=$id
            return $name
            )
        return <name>{$unique-name}</name>
    }
    </person>

The problem is that this is really slow. I imagine the bottleneck is the innermost loop, which executes once for every id (of which there are about 1200). I'm dealing with a fair bit of data (300 MB, spread over about 800 xml files), so even a single execution of the query in the inner loop takes about 12 seconds, which means that repeating it 1200 times will take about 4 hours (which might be optimistic - the process has been running for 3 hours so far). Not only is it slow, it's using a whole lot of virtual memory. I'm using Saxon, and I had to set java's maximum heap size to 10 GB (!) to avoid getting out of memory errors, and it's currently using 6 GB of physical memory.

So here's how I'd really like to do this (in Pythonic pseudocode):

persons = {}
for id in ids:
    person[id] = set()
for person in all_the_people_in_my_xml_document:
    persons[person.id].add(person.name)

There, I just did it in linear time, with only one sweep of the xml document. Now, is there some way to do something similar in xquery? Surely if I can imagine it, a reasonable programming language should be able to do it (he said quixotically). The problem, I suppose, is that unlike Python, xquery doesn't (as far as I know) have anything like an associative array.

Is there some clever way around this? Failing that, is there something better than xquery that I might use to accomplish my goal? Because really, the computational resources I'm throwing at this relatively simple problem are kind of ridiculous.

Oliver Hallam · Accepted Answer

This unfortunately is a shortcoming in XQuery 1.0

XQuery 1.1 adds the group by clause to the syntax to resolve this problem, and your problem would be resolved with:

for $person in /person
let $id = $person/@id
group by $id
return  <people id="{$id}">{
          for $name in distinct-values($person)
          return <name>{$name}</name>
        }</people>

Unfortunately XQuery 1.1 is not widely implemented, so for the moment you are stuck without the group by clause.

As a developer on XQSharp I cannot speak for any other implementations, but we have spent a lot of time tweaking our optimizer to spot common group-by patterns in XQuery 1.1 and perform them with the algorithm you have specified.

In particular, the following version of your query:

declare variable $people as element(person, xs:untyped)* external;

for $id in distinct-values($people/@id)
return <people id="{$id}">{
          for $person in $people
          where $person/@id = $id
          return <name>{$person}</name>
       }</people>

is spotted as a group-by, as is evidenced by the following query plan:

library http://www.w3.org/2005/xpath-functions external;
library http://www.w3.org/2001/XMLSchema external;
declare variable $people external;

for $distinct-person in $people
let $id := http://www.w3.org/2005/xpath-functions:data($distinct-person/attribute::id)
group by
  $id
aggregate
  element {name} { fs:item-sequence-to-node-sequence($distinct-person) }
as
  $:temp:19
return
  element {person} { (attribute {id} { $id } , fs:item-sequence-to-node-sequence($:temp:19)) }

Note that the type annotation as element(person, xs:untyped)* is required, as without knowing that the nodes are untyped (not validated against a schema), the query processor has no way of knowing that $person/@id doesn't have multiple items in its data value. XQSharp does not yet support group by expressions where each node can have more than one key. However in this case a left outer join is still spotted, and so the complexity should be roughly n log n and not quadratic as you are experiencing.

Unfortunately though adding in the distinct-values around the set of people in the group (to filter out duplicate names) seems to stop XQSharp from finding the join; this has been filed as a bug. For now this could be solved by doing the query in two passes - grouping the names by id, and removing duplicate names.

In summary, there is not a better approach in XQuery 1.0, but some implementations (eg. XQSharp) will be able to evaluate this efficiently. If in doubt, check the query plan.

For a more detailed look at the join optimizations performed by XQSharp, take a look at this blog post.

Writing more efficient xquery code (avoiding redundant iteration)

Tags:

xml

xslt

xquery

Coquelicot

1 Answers

Oliver Hallam

Recent Activity

Donate For Us

Writing more efficient xquery code (avoiding redundant iteration)

Tags:

xml

xslt

xquery

Coquelicot

1 Answers

Oliver Hallam

Related questions

Recent Activity

Donate For Us