Here's a simplified version of a problem I'm working on: I have a bunch of xml data that encodes information about people. Each person is uniquely identified by an 'id' attribute, but they may go by many names. For example, in one document, I might find
<person id=1>Paul Mcartney</person>
<person id=2>Ringo Starr</person>
And in another I might find:
<person id=1>Sir Paul McCartney</person>
<person id=2>Richard Starkey</person>
I want to use xquery to produce a new document that lists every name associated with a given id. i.e.:
<person id=1>
<name>Paul McCartney</name>
<name>Sir Paul McCartney</name>
<name>James Paul McCartney</name>
</person>
<person id=2>
...
</person>
The way I'm doing this now in xquery is something like this (pseudocode-esque):
let $ids := distinct-terms( [all the id attributes on people] )
for $id in $ids
return <person id={$id}>
{
for $unique-name in distinct-values
(
for $name in ( [all names] )
where $name/@id=$id
return $name
)
return <name>{$unique-name}</name>
}
</person>
The problem is that this is really slow. I imagine the bottleneck is the innermost loop, which executes once for every id (of which there are about 1200). I'm dealing with a fair bit of data (300 MB, spread over about 800 xml files), so even a single execution of the query in the inner loop takes about 12 seconds, which means that repeating it 1200 times will take about 4 hours (which might be optimistic - the process has been running for 3 hours so far). Not only is it slow, it's using a whole lot of virtual memory. I'm using Saxon, and I had to set java's maximum heap size to 10 GB (!) to avoid getting out of memory errors, and it's currently using 6 GB of physical memory.
So here's how I'd really like to do this (in Pythonic pseudocode):
persons = {}
for id in ids:
person[id] = set()
for person in all_the_people_in_my_xml_document:
persons[person.id].add(person.name)
There, I just did it in linear time, with only one sweep of the xml document. Now, is there some way to do something similar in xquery? Surely if I can imagine it, a reasonable programming language should be able to do it (he said quixotically). The problem, I suppose, is that unlike Python, xquery doesn't (as far as I know) have anything like an associative array.
Is there some clever way around this? Failing that, is there something better than xquery that I might use to accomplish my goal? Because really, the computational resources I'm throwing at this relatively simple problem are kind of ridiculous.
This unfortunately is a shortcoming in XQuery 1.0
XQuery 1.1 adds the group by clause to the syntax to resolve this problem, and your problem would be resolved with:
for $person in /person
let $id = $person/@id
group by $id
return <people id="{$id}">{
for $name in distinct-values($person)
return <name>{$name}</name>
}</people>
Unfortunately XQuery 1.1 is not widely implemented, so for the moment you are stuck without the group by clause.
As a developer on XQSharp I cannot speak for any other implementations, but we have spent a lot of time tweaking our optimizer to spot common group-by patterns in XQuery 1.1 and perform them with the algorithm you have specified.
In particular, the following version of your query:
declare variable $people as element(person, xs:untyped)* external;
for $id in distinct-values($people/@id)
return <people id="{$id}">{
for $person in $people
where $person/@id = $id
return <name>{$person}</name>
}</people>
is spotted as a group-by, as is evidenced by the following query plan:
library http://www.w3.org/2005/xpath-functions external;
library http://www.w3.org/2001/XMLSchema external;
declare variable $people external;
for $distinct-person in $people
let $id := http://www.w3.org/2005/xpath-functions:data($distinct-person/attribute::id)
group by
$id
aggregate
element {name} { fs:item-sequence-to-node-sequence($distinct-person) }
as
$:temp:19
return
element {person} { (attribute {id} { $id } , fs:item-sequence-to-node-sequence($:temp:19)) }
Note that the type annotation as element(person, xs:untyped)* is required, as without knowing that the nodes are untyped (not validated against a schema), the query processor has no way of knowing that $person/@id doesn't have multiple items in its data value. XQSharp does not yet support group by expressions where each node can have more than one key. However in this case a left outer join is still spotted, and so the complexity should be roughly n log n and not quadratic as you are experiencing.
Unfortunately though adding in the distinct-values around the set of people in the group (to filter out duplicate names) seems to stop XQSharp from finding the join; this has been filed as a bug. For now this could be solved by doing the query in two passes - grouping the names by id, and removing duplicate names.
In summary, there is not a better approach in XQuery 1.0, but some implementations (eg. XQSharp) will be able to evaluate this efficiently. If in doubt, check the query plan.
For a more detailed look at the join optimizations performed by XQSharp, take a look at this blog post.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With