It seems that in elastic search you would define an index on a collection, whereas in a relational DB you would define your index on a column. If the entire collection is indexed, why does it need to be defined?
There is unfortunate usage of the word "index" which means slightly (edit: VERY) different things in ES and relational databases as they are optimized for different use cases.
An "index" in database is a secondary data structure which makes WHERE
queries and JOIN
s fast, and they typically store values exactly as they appear in the table. You can still have columns which aren't indexed, but then WHERE
s require a full table scan which is slow on large tables.
An "index" in ES is actually a schematic collection of documents, similar to a database in the relational world. You can have different "types" of documents in ES, quite similar to tables in dbs. ES gives you the flexibility of defining for each document's field whether you want to be able to retrieve it, search by it or both. Some details on these options can be found from for example here, also related to _source
field (the original JSON which was submitted to ES).
ES uses an inverted index to efficiently find matching documents, but most importantly it typically "normalizes" strings into tokens so that accurate free-text search can be performed. For example sentences might be splitted into individual words, words are normalized to lower case etc. so that searching for "holland" would match the text "Vacation at Holland 2015".
If a field does not have an inverted index, you cannot perform any searching on it (unlike dbs' full table scan). Interestingly you can also define fields so that you can use them for searching but you cannot retrieve them back, it is mainly beneficial when minimizing in disk and RAM usage is important.
Elastic search is by design a search engine not likely preferred for primary storage like SQL server or Mongo DB etc.
Why entire collection is indexed?
Elastic search internally uses a structure called inverted index which stores each fields(column) value for searching. If the field contains string it will tokenize it, and perform filtering like lower case or upper case etc.
Any way you can find only the data that are available in inverted index. So by default elastic search perform indexing for all fields to make it available/searchable to you.
https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-index.html
This is not the like adding index for Relational DB. In Relational DB you have all the data available then what you need is to index most used columns for quicker find. But its vary less efficient to finding all the rows containing a part of a given word(searching a word)
I'll refer to:
"It seems that in elastic search you would define an index on a collection"
In Elasticsearch, an index is like a database in the relational world. The index contains multiple documents just like a relational database contain tables.
Until now, it is very clear.
In order to manage large amount of data, Elasticsearch (as a distributed database by nature) breaks each index into smaller chunks which are called shards which are being distributed across the Elasticsearch nodes.
The confusion starts with the fact the shards are data structures which are based on the Apache Lucene library.
Apache Lucene's index falls into the family of indexes known as an inverted index.
It is called "inverted index" because it list for a term, the documents that contain it:
Term Document Frequency
Brasil doc_id_1, doc_id_8 4 (2 in doc_id_1, 2 in doc_id_8)
Argentina doc_id_1, doc_id_6 3 (2 in doc_id_1, 1 in doc_id_6)
So, as you can see above, this structure stores statistics (frequencies) about terms in order to make term-based search more efficient.
(*) This is an inverse (Term -> Document)
of the natural relationship, in which documents list terms (Document -> Terms)
.
Summary:
1 ) Elasticsearch index:
There are 2 different usages for the word "index".
One is quiet trivial - index is like a database.
The other is confusing - Shards are based on a data structure named "inverted index".
2 ) Relational Databases index:
A structure which is associated with a table or view that speeds retrieval of rows from the table or view.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With