I'm looking at using Titan to create a scalable geospatial data store (I'm thinking R trees). In the documentation, there is a GeoShape
query, and the docs say that titan can do geo data with Lucene or ElasticSearch. However, it seems like this would be very slow because traversing nodes in cassandra is essentially doing join queries in cassandra which is a really bad idea. I think I might be misunderstanding the data representation.
I read the Titan Data Model doc, and I still don't quite get it. If all the edges are stored in a Cassandra row, then Titan would still have to "join" on a vertex table. One way to solve this would be to make the column value equal to the edge property data, and then you could neatly package the vertex data and the edge data into the row. However, this breaks down when you want to do queries deeper than 1 node, and we're back to the joining problem again.
So. Is titan emulating join queries in Cassandra? - and - How performant is it at geo lookups under these conditions?
I think the question conflates edge traversal with geospatial index lookups. These are separate at both the API and implementation levels. The index is not illustrated in the data model pictures.
Let's make this a little bit more specific. Say I run Titan with ES and Cassandra using Murmur3Partitioner or RandomPartitioner. I declare an ES geospatial index over edges called "place", as documented in the Getting Started page. Looking up edges by geospatial queries, such as this "WITHIN" in the Getting Started docs, first hits ES. ES returns IDs Titan can use to lookup the associated vertex/edge data in Cassandra quickly, without doing an analog to relational joins.
The cost of these edge lookups by geospatial data should be roughly equivalent to the cost of ES's WITHIN
implementation (which I think is delegated to Spatial4j), plus the lookups Titan makes on Cassandra after getting IDs, which should be roughly linear in the number of edges found by ES. This is just back-of-the-envelope estimation, so please take it with a big grain of salt.
After I get place edges by geo matching, if I then want to run arbitrary traversals in the neighborhood of each edge in the set, then I would have a look at rooting a MultiQuery on the head/tail vertices and enabling database-level caching. If the query misses cache or cache is cold/disabled, then Titan will still attempt to retrieve all edges the traversal cares about in a single Cassandra slice per vertex, when possible. If you're concerned about Titan's edge traversal efficiency, then you might find Boutique Graph Data with Titan interesting.
HTH
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With