How to correctly denormalize one-to-many indexes coming from multiple sources

Question

How can I restructure below elastic indexes to be able to search for registrations that had certain mailing events?

In our application we have the Profile entity which can have one to multiple Registration entities.
The registrations index is used in the majority of searches and contains the data we want to return.
Then we have multiple *Events indexes that contain events that relate to profiles.

A simplified version would look like this:

Registrations
  - RegistrationId
  - ProfileId
  - Location
  
MailEvents
  - ProfileId
  - Template
  - Actions

A simplified search might be: all the registrations in a certain location with any mailevent action for templates starting with "Solar".

Joining like in a classical RDB is an anti-pattern in elastic Db.
We are considering de-normalizing by adding all the various events for profiles to the registrations index? This wil result in an explosion of data in the registrations index. Nested objects are also bad for searching, so we should somehow make them into arrays. But how? We have 100's of rows in the events for every related row in registration. The change rates on the event indexes is way higher then the ones on the registration index.

We are considering doing two requests. One for all the *Events indexes, gathering all the profileIds, unique-ing them, then doing one for the registration part with the result of the first one. It feels wrong and introduces complicated edge cases where there are more results then the max returned rows in the first request or max Terms values in the second.

By searching around I see many people struggling with this and looking for a way to do join queries.
It feels like de-normalizing is the way to go, but what would be the recommended approach?
What other approaches am I missing?

Sam · Accepted Answer

One approach to consider is using Elasticsearch's parent-child relationship, which allows for denormalization in a way that makes it efficient for search. With parent-child, you would make the Registrations index the parent and the MailEvents index the child. This would allow you to store all the MailEvents data directly under each Registration document and would allow for efficient search and retrieval.

Additionally, you could consider using the has_child query to find all Registrations documents that have a certain MailEvent criteria. For example, to find all Registrations with a MailEvent action for templates starting with "Solar", you could write a query like this:

GET /registrations/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "Location": "some_location"
          }
        },
        {
          "has_child": {
            "type": "mailevents",
            "query": {
              "bool": {
                "must": [
                  {
                    "prefix": {
                      "Template": "Solar"
                    }
                  },
                  {
                    "exists": {
                      "field": "Actions"
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

This approach would give you the best of both worlds - you'd have denormalized data that's efficient for search and retrieval, while also avoiding the complexities of multiple requests and potential edge cases.

Another approach is to use Elasticsearch's aggregation feature. In this approach, you would perform a single search query on the Registrations index, filtered by the desired location. Then, you would use the ProfileId field to aggregate the data and retrieve the related MailEvents information. You can achieve this by using a nested aggregation, where you group by ProfileId and retrieve the relevant MailEvents data for each profile.

Here's an example query that performs this aggregation:

GET /registrations/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "Location": "some_location"
          }
        }
      ]
    }
  },
  "aggs": {
    "profiles": {
      "terms": {
        "field": "ProfileId"
      },
      "aggs": {
        "mail_events": {
          "nested": {
            "path": "MailEvents"
          },
          "aggs": {
            "filtered_mail_events": {
              "filter": {
                "bool": {
                  "must": [
                    {
                      "prefix": {
                        "MailEvents.Template": "Solar"
                      }
                    },
                    {
                      "exists": {
                        "field": "MailEvents.Actions"
                      }
                    }
                  ]
                }
              },
              "aggs": {
                "actions": {
                  "terms": {
                    "field": "MailEvents.Actions"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

This query will return the Registrations documents that match the desired location, and also provide aggregated information about the related MailEvents data. You can further manipulate the aggregated data to get the information that you need.

Note that this approach can be more complex than the parent-child relationship approach and may have performance implications if your data is large and complex. However, it may be a good solution if you need to perform complex aggregations on the MailEvents data.

How to correctly denormalize one-to-many indexes coming from multiple sources

Tags:

elasticsearch

Boris Callens

1 Answers

Sam

Recent Activity

Donate For Us

How to correctly denormalize one-to-many indexes coming from multiple sources

Tags:

elasticsearch

Boris Callens

1 Answers

Sam

Related questions

Recent Activity

Donate For Us