The question wasn't clear enough, I think; here's an updated straight to the point question:
What are the common architectures used in building a meta search engine and is there any libraries available to build that type of search engine?
I'm looking at building an "enterprise" type of search engine where the indexed data could be coming from proprietary (like Autonomy or a Google Box) or public search engines (like Google Web or Yahoo Web).
A metasearch engine (or search aggregator) is an online information retrieval tool that uses the data of a web search engine to produce its own results. Metasearch engines take input from a user and immediately query search engines for results.
The search engine architecture comprises of the three basic layers listed below: Content collection and refinement. Search core. User and application interfaces.
In simple terms, a metasearch engine takes the query you've entered and gathers results from multiple search engines online, such as Google, Bing, Yahoo, and more. They aggregate the results for you so you can choose the best information from the search results provided.
In general, a search engine consists of three main components as shown in Figure 1: a crawler, an offline processing system to accumulate data and produce searchable index, and an online engine for realtime query handling.
If you look at Garlic (pdf), you'll notice that its architecture is generic enough and can be adapted to a meta-search engine.
UPDATE:
The rough architectural sketch is something like this:
+---------------------------+
| |
| Meta-Search Engine | +---------------+
| | | |
| +-------------------+ |---------| Configuration |
| | Query Processor | | | |
| | | | +---------------+
| +-------------------+ |
+-------------+-------------+
|
+----------+---------------+
+--+----------+-------------+ |
| | | |
| +-------+-------+ | |
| | Wrapper | | |
| | | | |
| +-------+-------+ | |
| | | |
| | | |
| +-------+--------+ | |
| | | | |
| | Search Engine | | |
| | | +-+
| +----------------+ |
+---------------------------+
The parts depicted are:
Have a look at Lucene.
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With