Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to store multiple distinct types of documents in Lucene

I have an existing Lucene store with many millions of documents, each one representing metadata for an entity. I have a few Id fields (Id1, Id2 .. Id5) and each document can have zero or many values for this field. The index is only ever queried by one of these Ids at a time. I've indexed these fields independently and it's is all working great. I initially chose to use Lucene as it was by far the fastest way to query such a vast number of small documents and I am happy with my decision.

However now I must store another type of document which also represent a different kind of metadata for entities and have values for (Id1, Id2 .. Id5), and which also will be queried by one of those Ids separately. The existing metadata and this new set of data will be stored and queried for independently from each other.

How do I query Lucene by an Id but for only one type of document. I can think of a few options, but I'd like to know what those in the know recommend from experience in order to keep Lucene manageable and fast.

  1. Use separate Lucene indexes. This would avoid the problem since the document types are orthogonal. There's also the benefit being able to read and write from the indexes separately.
  2. Rename the fields Id1..Idn for the new documents to XId1...XIdn. In this way, documents of one type would not have the same field names as documents of another type. This seems like more of a workaround to avoid the problem than an actual solution.
  3. Add a numeric field "Type" and change indexies to (Type, Idx). This method seems wasteful as each index would have to also contain the type.

I am able to break backwards compatibility with my existing setup. It would be great if the solution can be reused if I come to add another document type.

like image 556
andrewjsaid Avatar asked Sep 25 '15 13:09

andrewjsaid


1 Answers

I would definitely reject third option because of low selectivity of type index. There will be only 2 distinct values in type field each one with millions of documents. Lucene will need to merge this huge posting list with short posting list from idN index, which still can be very fast, but indeed wasteful.

First two ways are effectively the same on query phase, because you have different terms and posting lists for independent type of documents. Difference will be on the indexing phase. Managing several independent indexes require a bit more coordination and makes code a little bit more difficult. Yet it may be a good idea if you have plans on using indexes in different contexts. For example:

  • physical location;
  • backup strategies;
  • availability requirements;
  • time-to-index requirements (time from a document changed on client side until it visible in index)

Otherwise, I would go with a first option as more simple and manageable.

like image 85
Denis Bazhenov Avatar answered Nov 15 '22 09:11

Denis Bazhenov