I am using Lucene.net 3.0.3 and AzureDirectory 2.0.4937.26631 which I installed from NuGet (called Lucene.Net.Store.Azure in NuGet).
The project description at azuredirectory.codeplex.com states "To be more concrete: you can have 1..N worker roles adding documents to an index, and 1..N searcher webroles searching over the catalog in near real time." (emphasis added) Implying that is is possible to have multiple worker roles writing to the index in parallel. However, when I try to do this I get many "Lock obtain timed out: [email protected]." exceptions.
My code follows the example given in the AzureDirectory documentation (azuredirectory.codeplex.com/documentation). My code is roughly (simplified for question).
var dbEntities = // Load database entities here
var docFactory = // Create class that builds lucene documents from dbEntities
var account = // get the CloudStorageAccount
var directory = new AzureDirectory(account, "<my container name>");
using(var writer = new IndexWriter(directory, new StandardAnalyzer(Version.LUCENE_30), createEvenIfExists, IndexWriter.MaxFieldLength.UNLIMITED))
{
foreach(var entity in entities)
{
writer.AddDocument(docFactory.CreateDocument(entity));
}
}
When run sequentially, this code works fine. However, if I run the same code in parallel on multiple threads/workers. I get many "Lock obtain timed out: [email protected]." exceptions:
[Lucene.Net.Store.LockObtainFailedException: Lock obtain timed out: [email protected].]
at Lucene.Net.Store.Lock.Obtain(Int64 lockWaitTimeout) in d:\Lucene.Net\FullRepo\trunk\src\core\Store\Lock.cs:line 83
at Lucene.Net.Index.IndexWriter.Init(Directory d, Analyzer a, Boolean create, IndexDeletionPolicy deletionPolicy, Int32 maxFieldLength, IndexingChain indexingChain, IndexCommit commit) in d:\Lucene.Net\FullRepo\trunk\src\core\Index\IndexWriter.cs:line 1228
at Lucene.Net.Index.IndexWriter..ctor(Directory d, Analyzer a, Boolean create, MaxFieldLength mfl) in d:\Lucene.Net\FullRepo\trunk\src\core\Index\IndexWriter.cs:line 1018
I understand that a "write.lock" file is created in blob storage and when the file contains the text "wrote.lock" the lock is held. I see from my searches that users have had problems with the write.lock not getting cleaned up. That doesn't seem to be my problem since I can get the same code to work correctly when run in sequence, and the lock file is cleaned up in that case.
I see in the AzureDirectory documentation (azuredirectory.codeplex.com/documentation) that "The index can only be updated by one process at a time, so it makes sense to push all Add/Update/Delete operations through an indexing role." However, that doesn't make any sense since any role you create should have multiple instances, so there would be multiple instances writing to the index in parallel. Also, the project description directly states that "you can have 1..N worker roles adding documents to an index." Note it says "an" index, not shards of index.
Question:
So, is the project description simply wrong? Or is there actually some way to have multiple IndexWriters adding to an index in parallel? I can't see anything in the API to allow that. If it is possible, please provide a code snippet of how to use AzureDirectory to "have 1..N worker roles adding documents to an index" in parallel.
The most performant way to do this is...
1) use the producer/consumer design pattern
2) For large indexes the producer/consumer pattern should produce separate indexes. For example, if I have 4 writers I build 4 indexes then I use the Lucene API to merge them
3) After that you have a nice index on your hard drive. The final step to use AzureDirectory, is to use the Lucene Directory.Copy command that copies your index from the FSDirectory (hard drive) to the Azure Directory.
I have used this for both the IaaS/PaaS offering in Azure and this works great. Keep in mind, (I mentioned this before in posts) AzureDirectory in my opinion is not "Enterprise" or "serious production" ready...some things like: network retries, uploading large indexes, compression of large indexes needed to be addressed before I was able to call it "production ready". If you can, use the IaaS Azure offering and then you don't need Azure Directory and you use the vanilla FSDirectory to build/surface your indexes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With