I've developed an index and search application with Lucene library. but this library has some limitation in custom ranking in my context, aside from its performance, i need scalability and access to all kinds of word frequencies and etc. is there any powerful open source full text library available?
What is full-text search databases?
Full-text search refers to searching some text inside extensive text data stored electronically and returning results that contain some or all of the words from the query. In contrast, traditional search would return exact matches.
What is ElasticSearch full-text search?
Full-text search queries and performs linguistic searches against documents. It includes single or multiple words or phrases and returns documents that match search condition. ElasticSearch is a search engine based on Apache Lucene, a free and open-source information retrieval software library.
What is full-text search in MySQL?
A full-text index in MySQL is an index of type FULLTEXT . Full-text indexes can be used only with InnoDB or MyISAM tables, and can be created only for CHAR , VARCHAR , or TEXT columns.
What is the advantage of a full-text search?
Conclusion. Users searching full text are more likely to find relevant articles than searching only abstracts. This finding affirms the value of full text collections for text retrieval and provides a starting point for future work in exploring algorithms that take advantage of rapidly-growing digital archives.
http://www.sphinxsearch.com
http://www.sphinxconnector.net/
Key Sphinx features are:
- high indexing and searching performance;
- advanced indexing and querying tools (flexible and feature-rich text tokenizer, querying language, several different ranking modes, etc);
- advanced result set post-processing (SELECT with expressions, WHERE, ORDER BY, GROUP BY etc over text search results);
- proven scalability up to billions of documents, terabytes of data, and thousands of queries per second;
- easy integration with SQL and XML data sources, and SphinxAPI, SphinxQL, or SphinxSE search interfaces;
- easy scaling with distributed searches.
To expand a bit, Sphinx:
- has high indexing speed (upto 10-15 MB/sec per core on an internal benchmark);
- has high search speed (upto 150-250 queries/sec per core against 1,000,000 documents, 1.2 GB of data on an internal benchmark);
- has high scalability (biggest known cluster indexes over 3,000,000,000 documents, and busiest one peaks over 50,000,000 queries/day);
- provides good relevance ranking through combination of phrase proximity ranking and statistical (BM25) ranking;
- provides distributed searching capabilities;
- provides document excerpts (snippets) generation;
- provides searching from within application with SphinxAPI or SphinxQL interfaces, and from within MySQL with pluggable SphinxSE storage engine;
- supports boolean, phrase, word proximity and other types of queries;
- supports multiple full-text fields per document (upto 32 by default);
- supports multiple additional attributes per document (ie. groups, timestamps, etc);
- supports stopwords;
- supports morphological word forms dictionaries;
- supports tokenizing exceptions;
- supports both single-byte encodings and UTF-8;
- supports stemming (stemmers for English, Russian and Czech are built-in; and stemmers for French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Swedish, Norwegian, Danish, Finnish, Hungarian, are available by building third party libstemmer library);
- supports MySQL natively (all types of tables, including MyISAM, InnoDB, NDB, Archive, etc are supported);
- supports PostgreSQL natively;
- supports ODBC compliant databases (MS SQL, Oracle, etc) natively;
- ...has 50+ other features not listed here, refer to API and configuration manual!