Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Neo4j indexing (with Lucene) - good way to organize node "types"?

This is more actually more of a Lucene question, but it's in the context of a neo4j database.

I have a database that's divided into 50 or so node types (so "collections" or "tables" in other types of dbs). Each has a subset of properties that need to be indexed, some share the same name, some don't.

When searching, I always want to find nodes of a specific type, never across all nodes.

I can see three ways of organizing this:

  • One index per type, properties map naturally to index fields: index 'foo', 'id'='1234'.

  • A single global index, each field maps to a property name, to distinguish the type either include it as part of the value ('id'='foo:1234') or check the nodes once they're returned (I expect duplicates to be very rare).

  • A single index, type is part of the field name: 'foo.id'='1234'.

Once created, the database is read-only.

Are there any benefits to one of those, in terms of convenience, size/cache efficiency, or performance?

As I understand it, for the first option neo4j will create a separate physical index for each type, which seems suboptimal. For the third, I end up with most lucene docs only having a small subset of the fields, not sure if that affects anything.

like image 1000
Dmitri Avatar asked Oct 09 '22 07:10

Dmitri


2 Answers

I came across this problem recently when I was building an ActiveRecord connection adapter for Neo4j over REST, to be used in a Rails project. Since ActiveRecord and ActiveRelation, both, have a tight coupling with SQL syntaxes, it became difficult to fit everything into NoSQL. Might not be the best solution, but here's how I solved it:

  1. Created an index named model_index which indexed nodes under two keys, type and model
  2. Index lookup with type key currently happens with just one value model. This was introduced primarily to achieve a SHOW TABLES SQL functionality which can get me a list of all models present in the graph.
  3. Index lookup with model key takes place with values corresponding to different model names in my system. This is primarily for achieving DESC <TABLENAME> functionality.
  4. With each table creation as in CREATE TABLE, a node is created with table definition attributes being stored in node properties.
  5. Created node is indexed under model_index with type:model and model:<model-name>. This enables the newly created model in the list of 'tables' and also allows one to directly reach the model node by an index lookup with model key.
  6. For each record created per model (type in your case), an outgoing edge is created labeled instances directed from model node to this new record. v[123] :=> [instances] :=> v[245] where v[123] represents model node and v[245] represents a record of v[123]'s type.
  7. Now if you want to get all instances of a specified type, you could lookup the model_index with model:<model-name> to reach a model node and then fetch all adjacent nodes over an outgoing edge labeled instances. Filtered lookups can be further achieved by applying filters and other complex traversals.

The above solution prevents model_index from clogging since it contains 2x and achieves an effective record lookup via one index lookup and single-level traversal.

Although in your case, nodes of different types are not adjacent to each other, even if you wanted to do so, you could determine the type of any arbitrary node by simply looking up it's adjacent node with an incoming edge labeled instances. Further, I'm considering the incorporate SpringDataGraph's pattern of storing a __type__ property on each instance node to avoid this adjacent node lookup.

I'm currently translating AREL to Gremlin scripts for almost everything. You could find the source code for my AR Adapter at https://github.com/yournextleap/activerecord-neo4j-adapter

Hope this helps, Cheers! :)

like image 149
rhetonik Avatar answered Oct 12 '22 21:10

rhetonik


A single index will be smaller than several little indexes, because some data, such as the term dictionary, will be shared. However, since a term dictionary lookup is a O(lg(n)) operation, a lookup in a bigger term dictionary might be a little slower. (If you have 50 indexes, this would only require 6 (2^6>=50) more comparisons, it is likely you won't notice any difference.)

Another advantage of a smaller index is that the OS cache is likely to make queries run faster.

Instead of your options 2 and 3, I would index two different fields id and type and search for (id:ID AND type:TYPE) but I don't know if it is possible with neo4j.

like image 45
jpountz Avatar answered Oct 12 '22 19:10

jpountz