I'm pretty sure I know the answer to this question but am looking for confirmation from someone with more Elasticsearch experience than me.
Let's say I've got a database containing Authors and Books. An author can be associated with 0 or more books, and a book can be associated with 1 or more authors. We want users to be able to search on author name to find the author and all his/her books, and we also want them to be able to search on book title to get back its author(s). We know there will be plenty of multi-author books.
Because Elasticsearch only directly supports one level of parent-child relationships, and because children can only have one parent, it seems to me that we need to denormalize the data and use nested objects to establish this relationship. If we modify properties of an author who has published 23 books, we will need to reindex the author record and all 23 of his/her book records.
In my fantasy world, I'd love to have those 23 books each contain an array of author IDs so that I don't have to reindex books when I reindex authors. It seems like this would definitely be possible using Elasticsearch's parent-child support if a book could only have one author, but because of the many-to-many requirement, I have to use nested objects and reindex any related objects whenever anything changes.
Is this correct? It certainly seems like more work (and certainly more updates), but I want to do this the right way, not the "clever" way that introduces complexity and bugs and madness.
Any guidance would be appreciated.
Because Elasticsearch is not a relational database, joins do not exist as a native functionality like in an SQL database. It focuses more on search efficiency as opposed to storage efficiency. The stored data is practically flattened out or denormalized to drive fast search use cases.
Every Elasticsearch index made up of one or more shards which are internally Lucene index which has a hard limit of ~2 billion(precisely equal to INT_MAX value-128) on maximum no of the document as explained in this link and this link.
By default, you cannot use from and size to page through more than 10,000 hits. This limit is a safeguard set by the index. max_result_window index setting. If you need to page through more than 10,000 hits, use the search_after parameter instead.
ElasticSearch is built with an open-source Lucene for high performance. The open-source Apache Lucene is made with Java, ElasticSearch internally uses Apache Lucene for indexing and searching.
From your question I can safely assume that ES will not be your primary data-store. So the main question as to how to denormalise your many-to-many relationship is to figure out "how & what" will you use ES. That is what queries are you expected to build.
Thinking of "query command" design and denormalize accordingly. Here are a few pointers:
Hope this makes some sense ;)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With