I'm thinking about developing a system to perform highly parallel queries on nested (but tree-like) data. The potential users are data analysts (physicists, specifically), not programmers. For the user interface, I want to use a well-known query language to avoid proliferating new languages.
Most of the data would be structured like this (imagine the following schema for billions of event
structures):
event: struct
|
+--- timestamp: bigint
+--- missing energy: float
+--- tracks: array of struct
| |
| +--- momentum: float
| +--- theta angle: float
| +--- hits: array of struct
| |
| +--- detector id: int
| +--- charge: float
| +--- time: float
| +--- ...
+--- showers: array of struct
|
+--- ...
The database would be read-only, and most of the queries would be things like:
et cetera. What these queries have in common is that they all resolve to one scalar per event, though they delve into the arrays of structs to do it. They perform "reduce" like operations (generally fold
in Scala, aggregate
in Spark, DAF in SQL) across filtered, transformed subsets of those arrays. I could write them in Scala like this:
// missing check for when zero tracks passed filter!
{event => event.tracks // get list of tracks
.filter(abs(_.theta) < 2.4) // in theta range
.maxBy(_.hits.size) // take the one with the most hits
.momentum // return its momentum
}
{event => mean(
event.tracks // get list of tracks
.filter(_.momentum > 10) // in momentum range
.flatMap(_.hits) // explode to hits
.filter(_.time < 10) // in time range
.map(_.charge) // return their charges
)} // ... to the mean function
// again missing check for less than two tracks!
{event => val List(one, two) = // unpack and assign "one" and "two"
event.tracks // get list of tracks
.sortBy(_.momentum) // sort by momentum
.take(2) // take the first two
// now compute the weighted mean of structs "one" and "two"
(one.theta*one.momentum + two.theta*two.momentum) /
(one.momentum + two.momentum)
}
Why not just use Scala? My program is implemented in C and will run on GPUs. Whatever Scala I bring to it would be a reimplemented subset--- in other words, an invented language. (The same could be said for Haskell, Javascript, or other language that makes heavy use of functions as arguments.)
Also, these queries ought to be declarative. If I implement too much of a general purpose programming language, details like the order of function calls might become relevant.
Why not just use SQL? Is it possible to write queries like the above easily, such that they're readable by anyone other than the author? Queries like the above are the norm, not complex extremes.
SQL supports nested arrays of structs, but all the examples I can find of using that substructure are horrendously complicated. One has to explode the table of events into a table of tracks (or double-explode to get hits), and some complex accounting would be needed to unexplode and get back to one scalar per event.
I suppose I could use SQL with new functions like MAXIMAL(collection, function)
that return a struct from an array, similar to track[12]
but using the user-provided function as an objective function for maximizing, minimizing, finding the top/bottom N, etc. I don't think SQL supports passing functions as arguments. If I write an SQL that does, it would be non-standard.
Is there a widely used dialect of SQL that supports passing functions as arguments?
Or is there another declarative language I should consider?
JSONiq was designed exactly for querying trees, even and especially when data is highly nested and heterogeneous. It is 95% based on a W3C standard.
Rumble is an open-source implementation of JSONiq that works on collections of billions of records. It uses Spark below the hood but in a way completely transparent to (and hidden from) the user.
The three queries look like so. With Rumble, they can seamlessly run on a laptop on a small amount of data, but also in parallel on potentially billions of objects on a cluster, as long as the underlying parser is streamlined for it. The declarative code is the same.
Query 1:
(
for $track in root-file("events.root", "Events").tracks
where abs($track.theta) lt 2.4
order by size($track.hits) descending
return track
)[1].momentum
Query 2:
root-file("events.root", "Events").tracks[$$.momentum gt 10].hits[][$$.time lt 10].charge
Query 3:
let $tracks := (
for $track in root-file("events.root", "Events").tracks
order by $track.momentum
return $track
)[position() le 2]
return
sum(
for $t in $tracks
return $t.theta * $t.momentum
) div sum($tracks.momentum)
I posted this in a comment earlier, but moving it here.
I'm with others on the use of a graph database. I'm not familiar with Neo4j queries, but I expect them to be capable. Similarly, SPARQL would work well for this kind of thing.
For the first query, a SPARQL query might look like:
PREFIX : <http://yournamespace.com/accelerator/> .
SELECT ?momentum (MAX(?hitcount) as ?maxhits)
WHERE {
SELECT ?momentum (COUNT(?hits) AS ?hitcount)
WHERE ?track :momentum ?momentum .
?track :theta ?theta .
FILTER (?theta > -2.4 AND ?theta < 2.4) .
?track :hits ?hits
GROUP BY ?track
}
GROUP BY ?momentum;
Identifiers have the : prefix on them because they need to be encoded as URIs. But that's an internal detail for moving to RDF (the data format for SPARQL databases).
The above query is doing sub-queries because you're looking to aggregate (on the count), and then aggregate again (with the max of the count). But you can see that it's all handled in an SQL-like way, and does not require post-processing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With