I'm using Zend_Search_Lucene to create an index of articles to allow them to be searched on my website. Whenever a administrator updates/creates/deletes an article in the admin area, the index is rebuilt:
$config = Zend_Registry::get("config");
$cache = $config->lucene->cache;
$path = $cache . "/articles";
try
{
$index = Zend_Search_Lucene::open($path);
}
catch (Zend_Search_Lucene_Exception $e)
{
$index = Zend_Search_Lucene::create($path);
}
$model = new Default_Model_Articles();
$select = $model->select();
$articles = $model->fetchAll($select);
foreach ($articles as $article)
{
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text("title", $article->title));
$index->addDocument($doc);
}
$index->commit();
My question is this. Since I am reindexing the articles and handling deleted articles as well, why would I not just use "create" every time (instead of "open" and update)? Using the above method, I think the articles would be added with addDocument every time (so there would be duplicates). How would I prevent that? Is there a way to check if a Document exists already in the index?
Also, I don't think I fully understand how the indexing works when you "open" and update it. It seems to create new #.cfs (so I have _0.cfs, _1.cfs, _2.cfs) files in the index folder every time, but when I use "create", it overwrites that file with a new #.cfs file with the # incremented (so, for example just _2.cfs). Can you please explain what these segmented files are?
Yes , you can check if a Document is already in the index, have a look in this Manual Page. You can then delete this specific Document from the index via $index->delete($id);, where $id is the return value of the termDocs method. After that you can simply add the new version of the Document.
About the multiple index files that Lucene creates: Every time you modify an existing index, Lucene does not realy change the existing files, but adds partial indexes for every change you make. This is extremely bad for performance, but there is a simple way around this. After every change you make to the index do this: $index->optimize(); - this will append all the partial files to the real index, improving searchtimes dramatically.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With