I have two Jackrabbit instances containing the same content. Rebuilding the Lucene index is slow, 30+ hours, and the down-time needed in the cluster is risky. Is it possible to instead just re-index one Jackrabbit then copy the Lucene index from that instance to the other?
Naively copying the Lucene index files beneath the workspace directory doesn't work. The issue appears to be that the content is indexed by document number which maps to a UUID which maps to the JCR path for the indexed node, but these UUIDs are not stable for a given path between Jackrabbit instances. (Both are actually Day CQ publisher instances populated by replication from a CQ author instance.)
I've managed to find the UUID-to-path mapping in the repository under /jcr:system/jcr:versionStorage/ but I can't see an easy way to copy this between repositories along with the Lucene index. And then I can't find the UUID->document ID mapping anywhere in the files - is this part of the Lucene index too?
Thanks for any help. I'm leaning towards just re-indexing the second instance separately and accepting the downtime but any ideas to reduce risk or the elapsed time of reindexing the cluster appreciated!
In the end we're going the re-index-them-both route: we've managed to repurpose a test instance as an extra live instance that we can drop into the farm temporarily whilst we take the other two out in turn to re-index. However I'd still be interested in hearing better ways to do this!
That seems like a scary idea, honestly. I'm not sure there is any way to guarantee that you've got the same underlying data, even with identical content and hardware configuration.
If your performance numbers look like ours, the time to copy the entire repository is less than the time it takes to reindex. Have you considered just reindexing one repository, doing a backup/copy, and then configuring the backup/copy to be your second instance?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With