Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it better to store nested data or use flat structure with unique names in JSON?

In simple words: Is

{
    "diary":{
        "number":100,
        "year":2006
    },
    "case":{
        "number":12345,
        "year":2006
    }
}

or

{
    "diary_number":100,
    "diary_year":2006,
    "case_number":12345,
    "case_year":2006

}

better when using Elasticsearch?

In my case total keys are only a few (10-15). Which is better performance wise?

Use case is displaying data from noSQL database (mostly dynamoDB). Also feeding it into Elasticsearch.

like image 742
secretshardul Avatar asked Oct 30 '19 14:10

secretshardul


2 Answers

My rule of thumb - if you would need to query/update nested fields, use flat structure.

If you use nested structure, then elastic will make it flat but then has an overhead of managing those relations. Performance wise - flat is always better since elastic doesnt need to related and find nested documents.

Here's an excerpt from Managing Relations Inside Elasticsearch which lists some disadvantages you might want to consider.

Elasticsearch is still fundamentally flat, but it manages the nested relation internally to give the appearance of nested hierarchy. When you create a nested document, Elasticsearch actually indexes two separate documents (root object and nested object), then relates the two internally. Both docs are stored in the same Lucene block on the same Shard, so read performance is still very fast.

This arrangement does come with some disadvantages. Most obvious, you can only access these nested documents using a special nested query. Another big disadvantage comes when you need to update the document, either the root or any of the objects.

Since the docs are all stored in the same Lucene block, and Lucene never allows random write access to it's segments, updating one field in the nested doc will force a reindex of the entire document.

This includes the root and any other nested objects, even if they were not modified. Internally, ES will mark the old document as deleted, update the field and then reindex everything into a new Lucene block. If your data changes often, nested documents can have a non-negligible overhead associated with reindexing.

Lastly, it is not possible to "cross reference" between nested documents. One nested doc cannot "see" another nested doc's properties. For example, you are not able to filter on "A.name" but facet on "B.age". You can get around this by using include_in_root, which effectively copies the nested docs into the root, but this get's you back to the problems of inner objects.

like image 193
Polynomial Proton Avatar answered Sep 27 '22 01:09

Polynomial Proton


Nested data is quite good. Unless you explicitly declare diary and case as nested field, they will be indexed as object fields. So elasticsearch will convert them itself to

{
    "diary.number":100,
    "diary.year":2006,
    "case.number":12345,
    "case.year":2006

}

Consider also, that every field value in elasticsearch can be a array. You need the nested datatype only if you have many diaries in a single document and need to "maintain the independence of each object in the array".

like image 23
zbusia Avatar answered Sep 26 '22 01:09

zbusia