Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance impact of using a string of length 100 characters as _Id column in Elastic Search

I am planning to store events in elastic search. It can have around 100 million events at any point time. To de-dupe events, I am planning to create _id column of length 100 chars by concatenating below fields entity_id - UUID (37 chars) + event_creation_time (30 chars) + event_type (30 chars)

This store will be having normal reads & writes along with aggregate queries (no updates / deletes) Can you please let me know if there would be any performance impact or any other side-effects of using such lengthy string _id columns instead of default Ids.

Thanks, Harish

like image 940
Harish Avatar asked Jan 03 '16 15:01

Harish


1 Answers

The _id field is not indexed and not stored by default so there is no performance issue storage wise.

Since you will be indexing millions of documents, the only major performance issue you will face is while bulk indexing. You have to make sure there is a sequential pattern to your _ids. From the Docs

  • If you don’t have a natural ID for each document, use Elasticsearch’s auto-ID functionality. It is optimized to avoid version lookups, since the autogenerated ID is unique.
  • If you are using your own ID, try to pick an ID that is friendly to Lucene. Examples include zero-padded sequential IDs, UUID-1, and nanotime; these IDs have consistent, sequential patterns that compress well. In contrast, IDs such as UUID-4 are essentially random, which offer poor compression and slow down Lucene.

In that blog, long time Lucene committer Michael McCandless compares different ways of _id generation and IMO it is one of the finest articles I have read.

Hope this helps!

like image 148
ChintanShah25 Avatar answered Oct 13 '22 10:10

ChintanShah25