I want to use Lucene.NET for fulltext search shared between two apps: one is an ASP.NET MVC application and the other one is a console application. Both applications are supposed to search and update index. How the concurrency should be handled?
I found a tutorial on ifdefined.com where the similar use case is discussed. My concern is that locking will be a big bottleneck.
PS: Also I noticed that IndexSearcher uses a snapshot of index and in the tutorial mentioned above searcher is created only when index is updated. Is this a good approach? Can I just create a regular searcher object at each search and if yes what is the overhead?
I found a related question Does Lucene.Net manage multiple threads accessing the same index, one indexing while the other is searching? what claims that interprocess concurrency is safe. Does it mean that it is are no race conditions for index?
Also one very important aspect. What is the performance hit involved if let's say 10-15 threads are trying to update Lucene index via acquiring shared lock presented in this solution?
After using it couple of months I have to add that opening index for search often can create OutOfMemory exception under high CPU and memory loads if query uses sorting. Cost of index opening operation is small (in my experience) but cost of GC can be quite high.
First of all we have to define a "write" operation. A write operation will object a lock once you start a write operation and will continue until you close the object that is performing the work. Such as creating an IndexWriter and indexing a document will cause the write to object a lock and it will keep this lock until you close the IndexWriter.
Now we can talk about the lock a little bit. This lock that is object is a file based lock. Like mythz mentioned earlier, there is a file called 'write.lock' that is created. Once a write lock is objected it is exclusive! This lock causes all index modifying operations (IndexWriter, and some methods from IndexReader) to wait until the lock is removed.
Overall you and have multiple reads on an index. You can even read and write at the same time, no problem. But there is a problem when having multiple writers. If one thread is waiting for the lock too long it will time out.
1) Possible Solution #1 Direct Operations
If you are sure that your indexing operations are short and quick, you may be able to just use the same index at the same time. Otherwise you will have to think about how you want to organize the indexing operations of the applications.
2) Possible Solution #2 Web Service
Since you are working with a web solution it might be possible to create a web service. When implementing this web service I would dedicate a worker thread for indexing. I would create a work queue to contain the work and if the queue contained multiple jobs to do, it should grab them all and do them into batch. This will solve all of the problems.
3) create another index, then merge
If the console application does heavy work on the index you may be able to look into having the console application you could create a seperate index in the console application and then merge the indexes at some safe scheduled time using IndexWriter.AddIndexes.
from here you can do this in two ways, you can merge with the direct index. Or you can merge to create a 3rd index, and then when this index is ready replace the original index. You have to be careful in what your doing here as well to make sure that your not going to lock something in heavy use and cause a timeout for other write operations.
4) Index & Search multiple indexes
Personally I think people need to separate their indexes out. This helps separates responsibilities of the programs and minimizes down time and maintained of having a single point for all indexes. For example, if your console application is responsible for only adding in certain fields or your are kind of extending an index you could look separate the indexes out, but maintain identity by using an ID field in each document. Now with this you can take advantage of the built in support for searching multiple indexes using the MultiSercher class. Or if your wanting there is also a nice ParallelMultiSearch class that can search both indexes at once.
5) Look into SOLR
Something else that can help your issue of maintaining a single place for you index, you could change your program to work with a SOLR server. http://lucene.apache.org/solr/ there is also a nice SOLRNET http://code.google.com/p/solrnet/ library that can be helpful in this situation. Although I'm not experienced with solr but i am under the impression that it will help you manage situation such as this. Also it has other benefits such as hit highlighting and searching for related items by finding items "MoreLikeThis", or provide spell checking.
I'm sure there are other methods but these are all the ones that I can think of. Overall it your solution depends upon how many people are writing and how up to date the search index you need it to be. Overall if you can defer some operations for a latter time and do some batch operations in any situation will give you the most performance. My suggestion is to understand what your able to work with and go from there. good luck
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With