Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Lucene vs Google Search Appliance

Tags:

Has anyone come across with the features of Apache Lucene? I heard its even comparable to Google Search Appliance (GSA). I was looking for a definite comparison between the two, if possible?

Those comparisons available online are pretty vague.

like image 295
Riju Mahna Avatar asked May 24 '13 12:05

Riju Mahna


People also ask

Does Google use Lucene?

Despite these open-source bona fides, it's still surprising to see someone at Google adopting Solr, an open-source search server based on Apache Lucene, for its All for Good site. Google is the world's search market leader by a very long stretch.

Why use Elasticsearch instead of Lucene?

Elasticsearch is built over Lucene and provides a JSON based REST API to refer to Lucene features. Elasticsearch provides a distributed system on top of Lucene. A distributed system is not something Lucene is aware of or built for. Elasticsearch provides this abstraction of distributed structure.

Who uses Apache Lucene?

Who uses Lucene? 43 companies reportedly use Lucene in their tech stacks, including Twitter, Slack, and Evernote.

What is a Lucene index?

A Lucene Index Is an Inverted IndexA term combines a field name with a token. The terms created from the non-text fields in the document are pairs consisting of the field name and the field value. The terms created from text fields are pairs of field name and token.


1 Answers

Its probably hard to find a comparison between Apache Lucene and the Google Search Appliance because they're such different things. While Lucene is a software component for indexing documents with basic relevance "boosting" built in, the GSA is an enterprise search product (appliance/physical hardware) with lot's of out-of-the-box functionality to tune and optimize search results based off of the Google search algorithm.

So they are basically two great tools with different implementation scenarios. But of course overlap especially if used for providing search on your average website.

Off the top of my head a few topics you might want to start with for a comparison:

Deployment/Architecture

  • Lucene is a software component that can be deeply integrated in your own software providing an index (usually file based, sometimes in memory) to index and retrieve content quickly.
  • The lucene project provides quite a large list of analyzers to do propper indexing of different languages (western languages, arabic, asian etc.) but has room for improvements with analyzers
  • Lucene for .Net is quite a popular port to be integrated on Microsoft .Net Plattforms.
  • GSA software and hardware bundled together and sold as an appliance with an HTTP(s) interface providing the search results in either HTML (through its own XSLTs) or XML (for better integration in your website)
  • GSA comes with language bundles (installed and downloadable). You'd have to choose one of the bundles. If you need support for more languages you might need to add another GSA to the infrastructure (if all required languages are not in a single bundle)
  • GSA is performing excellent and requires very little maintenance
  • GSA let's you scale with almost no engineering effort. globally distributed, but connected GSAs can be set up through the web interface
  • GSA can be made HA by purchasing a cheaper hot-backup module

Indexing

  • Lucene provides crawlers (and a crawler API) to index content. It doesn't care if your crawler actually crawls the website like Google or if you crawl a database based on SQL statements or provide a text stream read out from flat files. But usually you have to implement the crawler if the provided does not fit your needs
  • GSA uses the crawler technology used by Google, respecting Robots instructions (in TXT or Meta tags), it provides a feed API for sources that can not be crawled (i.e. no linking between them) and it supports setting up SQL queries to all mayor DBs for retrievel of data out of a database (be it URLs to crawl or the data itself)

Retrieval / relevance tuning

  • Lucene does not aim at and has no good support for relevance tuning (except boosting entries in the index). It's up to the application using the index results to do the tuning
  • Lucene is the index used by SOLR which provides tuning and architectures more similar to a GSA (including result retrievel over HTTP(s))
  • GSA let's you bias result sets based on meta-data, date and URL patterns. In the latest version you can even set up your own entities and bias the results based on them
  • GSA supports out of the box facets for meta-data and some more fancy stuff on their interface like preview images for documents, autosuggest etc.

Commercial things

  • Lucene is an Open Source (no costs) Product, but requires hardware to be purchased
  • GSA starts at around $20k for 500k documents/URLs
  • Google provides several support levels
  • GSA licenses have to be renewed on a 2 or 3 year basis (you get new hardware)
  • GSA does not require any additional hardware (appliance is included)

...there's so much more to add, but I hope you get the point.


Update February 2016:

Google has informed partners that the GSA will be discontinued around 2019. The best site to link to at the moment seems to be http://fortune.com/2016/02/04/google-ends-search-appliance/.

like image 116
Reto Hugi Avatar answered Sep 27 '22 20:09

Reto Hugi