Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Supernodes" in Titan

I'm developing an application that could work well with a graph database (Titan), except it's having problems with vertices with many edges, i.e. supernodes.

The supernodes link above points to a blog post from the authors of Titan, explaining a way to resolve the problem. The solution seems to be reducing the number of vertices by filtering on edges.

Unfortunately I want to groupCount attributes of edges or vertices. For example I have 1 million users and each user belongs to a country. How can I do a fast groupCount to work out the number of users in each country?

What I've tried so far can be shown in this elaborate groovy script:

g = TitanFactory.open('titan.properties')  // Cassandra
r = new Random(100)
people = 1e6

def newKey(g, name, type) {
    return g
        .makeType()
        .name(name)
        .simple()
        .functional()
        .indexed()
        .dataType(type)
        .makePropertyKey()
}

def newLabel(g, name, key) {
    return g
        .makeType()
        .name(name)
        .primaryKey(key)
        .makeEdgeLabel()
}

country = newKey(g, 'country', String.class)
newLabel(g, 'lives', country)

g.stopTransaction(SUCCESS)

root = g.addVertex()
countries = ['AU', 'US', 'CN', 'NZ', 'UK', 'PL', 'RU', 'NL', 'FR', 'SP', 'IT']

(1..people).each {
    country = countries[(r.nextFloat() * countries.size()).toInteger()]
    g.startTransaction()
    person = g.addVertex([name: 'John the #' + it])
    g.addEdge(g.getVertex(root.id), person, 'lives', [country: country])
    g.stopTransaction(SUCCESS)
}

t0 = new Date().time

m = [:]    
root = g.getVertex(root.id)
root.outE('lives').country.groupCount(m).iterate()

t1 = new Date().time

println "groupCount seconds: " + ((t1 - t0) / 1000)

Basically one root node (for the sake of Titan not having an "all" nodes lookup), linked to many person via edges that have the country property. When I run the groupCount() on 1 million vertices, it takes over a minute.

I realise Titan is probably iterating over each edge and collecting counts, but is there a way to make this run faster in Titan, or any other graph database? Can the index itself be counted so it doesn't have to traverse? Are my indexes correct?

like image 422
gak Avatar asked Oct 30 '12 05:10

gak


1 Answers

If you make 'country' a primary key for the 'lives' label and then you can retrieve all people for a particular country more quickly. However, in your case you are interested in a group count which requires all edges of that root node to be retrieved in order to iterate over them and bucket the countries.

Hence, this analytical query is much better suited for the graph analytics framework Faunus. It does not require a root vertex as it executes the groupcount by way of a complete database scan and thus avoids the supernode problem. Faunus also uses Gremlin as the query language so you only have to modify your query slightly:

g.V.country.groupCount.cap...

HTH, Matthias

like image 131
Matthias Broecheler Avatar answered Sep 27 '22 23:09

Matthias Broecheler