I am planning to store events in elastic search. It can have around 100 million events at any point time. To de-dupe events, I am planning to create _id column of length 100 chars by concatenating below fields entity_id - UUID (37 chars) + event_creation_time (30 chars) + event_type (30 chars)
This store will be having normal reads & writes along with aggregate queries (no updates / deletes) Can you please let me know if there would be any performance impact or any other side-effects of using such lengthy string _id columns instead of default Ids.
Thanks, Harish
The _id
field is not indexed and not stored by default so there is no performance issue storage
wise.
Since you will be indexing millions of documents, the only major performance issue you will face is while bulk indexing
. You have to make sure there is a sequential pattern
to your _id
s. From the Docs
- If you don’t have a natural ID for each document, use Elasticsearch’s auto-ID functionality. It is optimized to avoid version lookups, since the autogenerated ID is unique.
- If you are using your own ID, try to pick an ID that is friendly to Lucene. Examples include zero-padded sequential IDs, UUID-1, and nanotime; these IDs have consistent, sequential patterns that compress well. In contrast, IDs such as UUID-4 are essentially random, which offer poor compression and slow down Lucene.
In that blog, long time Lucene committer Michael McCandless compares different ways of _id
generation and IMO it is one of the finest articles I have read.
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With