I'm developing an application that could work well with a graph database (Titan), except it's having problems with vertices with many edges, i.e. supernodes.
The supernodes link above points to a blog post from the authors of Titan, explaining a way to resolve the problem. The solution seems to be reducing the number of vertices by filtering on edges.
Unfortunately I want to groupCount
attributes of edges or vertices. For example I have 1 million users and each user belongs to a country. How can I do a fast groupCount
to work out the number of users in each country?
What I've tried so far can be shown in this elaborate groovy script:
g = TitanFactory.open('titan.properties') // Cassandra
r = new Random(100)
people = 1e6
def newKey(g, name, type) {
return g
.makeType()
.name(name)
.simple()
.functional()
.indexed()
.dataType(type)
.makePropertyKey()
}
def newLabel(g, name, key) {
return g
.makeType()
.name(name)
.primaryKey(key)
.makeEdgeLabel()
}
country = newKey(g, 'country', String.class)
newLabel(g, 'lives', country)
g.stopTransaction(SUCCESS)
root = g.addVertex()
countries = ['AU', 'US', 'CN', 'NZ', 'UK', 'PL', 'RU', 'NL', 'FR', 'SP', 'IT']
(1..people).each {
country = countries[(r.nextFloat() * countries.size()).toInteger()]
g.startTransaction()
person = g.addVertex([name: 'John the #' + it])
g.addEdge(g.getVertex(root.id), person, 'lives', [country: country])
g.stopTransaction(SUCCESS)
}
t0 = new Date().time
m = [:]
root = g.getVertex(root.id)
root.outE('lives').country.groupCount(m).iterate()
t1 = new Date().time
println "groupCount seconds: " + ((t1 - t0) / 1000)
Basically one root node (for the sake of Titan not having an "all" nodes lookup), linked to many person
via edges that have the country
property. When I run the groupCount() on 1 million vertices, it takes over a minute.
I realise Titan is probably iterating over each edge and collecting counts, but is there a way to make this run faster in Titan, or any other graph database? Can the index itself be counted so it doesn't have to traverse? Are my indexes correct?
If you make 'country' a primary key for the 'lives' label and then you can retrieve all people for a particular country more quickly. However, in your case you are interested in a group count which requires all edges of that root node to be retrieved in order to iterate over them and bucket the countries.
Hence, this analytical query is much better suited for the graph analytics framework Faunus. It does not require a root vertex as it executes the groupcount by way of a complete database scan and thus avoids the supernode problem. Faunus also uses Gremlin as the query language so you only have to modify your query slightly:
g.V.country.groupCount.cap...
HTH, Matthias
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With