Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to have parallel IndexWriters with AzureDirectory and Lucene.net?

I am using Lucene.net 3.0.3 and AzureDirectory 2.0.4937.26631 which I installed from NuGet (called Lucene.Net.Store.Azure in NuGet).

The project description at azuredirectory.codeplex.com states "To be more concrete: you can have 1..N worker roles adding documents to an index, and 1..N searcher webroles searching over the catalog in near real time." (emphasis added) Implying that is is possible to have multiple worker roles writing to the index in parallel. However, when I try to do this I get many "Lock obtain timed out: [email protected]." exceptions.

My code follows the example given in the AzureDirectory documentation (azuredirectory.codeplex.com/documentation). My code is roughly (simplified for question).

var dbEntities = // Load database entities here
var docFactory = // Create class that builds lucene documents from dbEntities
var account = // get the CloudStorageAccount
var directory = new AzureDirectory(account, "<my container name>");
using(var writer = new IndexWriter(directory, new StandardAnalyzer(Version.LUCENE_30), createEvenIfExists, IndexWriter.MaxFieldLength.UNLIMITED))
{
    foreach(var entity in entities)
    {
       writer.AddDocument(docFactory.CreateDocument(entity));
    }
}

When run sequentially, this code works fine. However, if I run the same code in parallel on multiple threads/workers. I get many "Lock obtain timed out: [email protected]." exceptions:

[Lucene.Net.Store.LockObtainFailedException: Lock obtain timed out: [email protected].]
   at Lucene.Net.Store.Lock.Obtain(Int64 lockWaitTimeout) in d:\Lucene.Net\FullRepo\trunk\src\core\Store\Lock.cs:line 83
   at Lucene.Net.Index.IndexWriter.Init(Directory d, Analyzer a, Boolean create, IndexDeletionPolicy deletionPolicy, Int32 maxFieldLength, IndexingChain indexingChain, IndexCommit commit) in d:\Lucene.Net\FullRepo\trunk\src\core\Index\IndexWriter.cs:line 1228
   at Lucene.Net.Index.IndexWriter..ctor(Directory d, Analyzer a, Boolean create, MaxFieldLength mfl) in d:\Lucene.Net\FullRepo\trunk\src\core\Index\IndexWriter.cs:line 1018

I understand that a "write.lock" file is created in blob storage and when the file contains the text "wrote.lock" the lock is held. I see from my searches that users have had problems with the write.lock not getting cleaned up. That doesn't seem to be my problem since I can get the same code to work correctly when run in sequence, and the lock file is cleaned up in that case.

I see in the AzureDirectory documentation (azuredirectory.codeplex.com/documentation) that "The index can only be updated by one process at a time, so it makes sense to push all Add/Update/Delete operations through an indexing role." However, that doesn't make any sense since any role you create should have multiple instances, so there would be multiple instances writing to the index in parallel. Also, the project description directly states that "you can have 1..N worker roles adding documents to an index." Note it says "an" index, not shards of index.

Question:

So, is the project description simply wrong? Or is there actually some way to have multiple IndexWriters adding to an index in parallel? I can't see anything in the API to allow that. If it is possible, please provide a code snippet of how to use AzureDirectory to "have 1..N worker roles adding documents to an index" in parallel.

like image 883
Jeff Walker Code Ranger Avatar asked Nov 12 '22 23:11

Jeff Walker Code Ranger


1 Answers

The most performant way to do this is...

1) use the producer/consumer design pattern

  • with this you can have x amount of threads/tasks with each individual writer writing to the index
  • you can have x amount of consumers (i.e. threads/tasks) reading from the database

2) For large indexes the producer/consumer pattern should produce separate indexes. For example, if I have 4 writers I build 4 indexes then I use the Lucene API to merge them

3) After that you have a nice index on your hard drive. The final step to use AzureDirectory, is to use the Lucene Directory.Copy command that copies your index from the FSDirectory (hard drive) to the Azure Directory.

  • this is important because AzureDirectory internally uses metadata properties on Azure Blob Storage to determine the "last update fingerprint" for an index
  • AzureDirectory also compresses the indexes, before uploading...This is the reason I like the hard drive step before sending it to Azure Blob Storage because I can use parallel threads to compress it on the hard drive. I changed the implementation of AzureDirectory because it does everything in memory and doing that for a 20 gig index is not good :)

I have used this for both the IaaS/PaaS offering in Azure and this works great. Keep in mind, (I mentioned this before in posts) AzureDirectory in my opinion is not "Enterprise" or "serious production" ready...some things like: network retries, uploading large indexes, compression of large indexes needed to be addressed before I was able to call it "production ready". If you can, use the IaaS Azure offering and then you don't need Azure Directory and you use the vanilla FSDirectory to build/surface your indexes.

like image 65
Bart Czernicki Avatar answered Dec 12 '22 04:12

Bart Czernicki