Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to store tree data in a Lucene/Solr/Elasticsearch index or a NoSQL db?

Say instead of documents I have small trees that I need to store in a Lucene index. How do I go about doing that?

An example node in the tree:

class Node
{
    String data;
    String type;
    List<Node> children;
}

In the above node the "data" member variable is a space separated string of words, so that needs to be full-text searchable. The "type" member variable is just a single word.

The search query will be a tree itself and will search both the data and type in each node and also the structure of the tree for a match. Before matching against a child node, the query must first match the parent node data and type. Approximate matching on the data value is acceptable.

What's the best way to index this kind of data? If Lucene does not directly support indexing these data then can this be done by Solr or Elasticsearch?

I took a quick look at neo4j, but it seems to store an entire graph in the db, not a large collection (say billions or trillions) of small tree structures. Or my understanding was wrong?

Also, is a non-Lucene based NoSQL solution is better suited for this?

like image 300
Golam Kawsar Avatar asked Apr 02 '12 02:04

Golam Kawsar


People also ask

Which is better SOLR or Elasticsearch?

Solr has more advantages when it comes to the static data, because of its caches and the ability to use an uninverted reader for faceting and sorting – for example, e-commerce. On the other hand, Elasticsearch is better suited – and much more frequently used – for timeseries data use cases, like log analysis use cases.

What is the difference between SOLR and Elasticsearch?

Solr includes a sample search UI, called Velocity Search, that offers powerful features such as searching, faceting, highlighting, autocomplete, and Geo Search. Elasticsearch's DSL is native. The aggregation framework in Elasticsearch is powerful with aggregation queries in the APIs with better caching.

Is Elasticsearch based on Lucene?

Lucene or Apache Lucene is an open-source Java library used as a search engine. Elasticsearch is built on top of Lucene. Elasticsearch converts Lucene into a distributed system/search engine for scaling horizontally.

Does SOLR use Lucene?

Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™.


2 Answers

Another approach is to store a representation of the current node's location in the tree. For example, the 17th leaf of the 3rd 2nd-level node of the 1st 1st-level node of the 14th tree would be represented as 014.001.003.017.

Assuming 'treepath' is the field name of the tree location, you would query on 'treepath:014*' to find all nodes and leaves in the 14th tree. Similarly, to find all of the children of the 14th tree you would query on 'treepath:014.*'.

The major problem with this approach is that moving branches around requires re-ordering every branch after the branch that was moved. If your trees are relatively static, that may only be a minor problem in practice.

(I've seen this approach called either a 'path enumeration' or a 'Dewey Decimal' representation.)

like image 129
Mark Leighton Fisher Avatar answered Oct 19 '22 06:10

Mark Leighton Fisher


This requirement and the solution is captured here: Proposal for nested docs

This design was subsequently implemented both by core Lucene and Elastic Search. The BlockJoinQuery is the core Lucene implementation and Elastic Search look to have an implementation as outlined here: Elastic search nested docs

like image 22
MarkH Avatar answered Oct 19 '22 05:10

MarkH