Is there more information available about the web crawler technology/engine used by Kentico 10 as per documentation Configuring Page Crawler Indexes?
The reason I'm asking is because I'd like to consider it for use in a custom crawler project that can sit outside of Kentico, and still allow for it to have an inherent compatibility with the Kentico platform.
As far as I can tell from the Kentico 10 source code, the crawler used by Kentico SmartSearch is completely proprietary. It's not using any 3rd party library.
It downloads the page content using System.Web.HttpWebRequest
. The full content is fed back into the SmartSearch indexer as a string. After that it goes through text extraction and is fed to Lucene for indexing.
It's not going to be easy to have Kentico SmartSearch use an external crawler. We usually stay away from the crawler because it is rather expensive to execute compared to the standard index that pulls data straight from the database.
Kentico supports executing some scheduled tasks in a Windows service but not the search tasks.
Note that Kentico SmartSearch doesn't actually crawl the site by discovering links. It uses the content tree to figure out what content it needs to index. If you want to index other content, for example from a system you integrate with, you need to implement a custom search service as described here.
One thing that would work is to have an external process crawl whatever content you want to index and put the raw HTML content into storage. Then write a custom SmartSearch index that pulls the data from storage for indexing within Kentico. If you're indexing content managed by Kentico, you could take that to the next level by hooking into document events. That should allow you to crawl pages only when they're updated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With