Elasticsearch index much larger than the actual size of the logs it indexed?

Tags:

I noticed that elasticsearch consumed over 30GB of disk space over night. By comparison the total size of all the logs I wanted to index is only 5 GB...Well, not even that really, probably more like 2.5-3GB. Is there any reason for this and is there a way to re-configure it? I'm running the ELK stack.

975

asked Jan 15 '15 20:01

Christopher Bruce

2 Answers

There are a number of reasons why the data inside of Elasticsearch would be much larger than the source data. Generally speaking, Logstash and Lucene are both working to add structure to data that is otherwise relatively unstructured. This carries some overhead.

If you're working with a source of 3 GB and your indexed data is 30 GB, that's a multiple of about 10x over your source data. That's big, but not necessarily unheard of. If you're including the size of replicas in that measurement, then 30 GB could be perfectly reasonable. Based on my own experience and intuition, I might expect something in the 3–5x range relative to source data, depending on the kind of data, and the storage and analysis settings you're using in Elasticsearch.

Here are four different settings you can experiment with when trying to slim down an Elasticsearch index.

The `_source` Field

Elasticsearch keeps a copy of the raw original JSON of each incoming document. It's useful if you ever want to reconstruct the original contents of your index, or for match highlighting in your search results, but it definitely adds up. You may want to create an index template which disables the _source field in your index mappings.

Disabling the _source field may be the single biggest improvement in disk usage.

Documentation: Elasticsearch _source field

Individual stored fields

Similarly but separately to the _source field, you can control whether to store the values of a field on a per-field basis. Pretty straightforward, and mentioned a few times in the Mapping documentation for core types.

If you want a very small index, then you should only store the bare minimum fields that you need returned in your search responses. That could be as little as just the document ID to correlate with a primary data store.

Documentation: Elasticsearch mappings for core types

The `_all` Field

Sometimes you want to find documents that match a given term, and you don't really care which field that term occurs in. For that case, Elasticsearch has a special _all field, into which it shoves all the terms in all the fields in your documents.

It's convenient, but if your searches are fairly well targeted to specific fields, and you're not trying to loosely match anything/everything anywhere in your index, then you can get away with not using the _all field.

Documentation: Elasticsearch _all field

Analysis in general

This is back to the subject of Lucene adding structure to your otherwise unstructured data. Any fields which you intend to search against will need to be analyzed. This is the process of breaking a blob of unstructured text into tokens, and analyzing each token to normalize it or expand it into many forms. These tokens are inserted into a dictionary, and mappings between the terms and the documents (and fields) they appear in are also maintained.

This all takes space, and for some fields, you may not care to analyze them. Skipping analysis also saves some CPU time when indexing. Some kinds of analysis can really inflate your total terms, like using an n-gram analyzer with liberal settings, which breaks down your original terms into many smaller ones.

Documentation: Introduction to Analysis and Analyzers

More reading

Peter Kim recently wrote about this subject: Hardware sizing or "How many servers do I really need?"

104

answered Nov 10 '22 07:11

Nick Zadrozny

As the previous commenter explaining in detail, there are many reasons why the size of log data after indexing into Elasticsearch could increase in size. The blog post he linked to is now dead because I killed my personal blog but it now lives on the elastic.co website: https://www.elastic.co/blog/elasticsearch-storage-the-true-story.

answered Nov 10 '22 09:11

petedogg

Related questions
                            
                                Alternatives to DebugView?
                            
                                JavaScript AJAX remote logger [closed]
                            
                                Logback FileAppender - logfile is empty?
                            
                                logback on a mac returns question marks instead of words
                            
                                How disable http logs from Rspec output?
                            
                                Add time stamp with std::cout
                            
                                Where can I find sql statement log in postgresql?
                            
                                Recommended way of applying Bunyan to a large node application?
                            
                                Winston log to file not working
                            
                                How to set custom output handlers for argparse in Python?
                            
                                Python logging handler to append to list
                            
                                How to get the Java default logger output to a file?
                            
                                Logging in .NET Core to file and console - with timestamps
                            
                                BeginScope with Serilog
                            
                                How can I setup my BlazeDS implementation with Log4J?
                            
                                Apple rejected me, no idea why, or how to fix it
                            
                                Logging for application in different environment Log4j
                            
                                How to AND log4net filters together
                            
                                Setting the logging level in Hadoop to WARN
                            
                                Converting log4j.properties file for Tomcat to work with Log4j2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Elasticsearch index much larger than the actual size of the logs it indexed?

Tags:

indexing

logging

elasticsearch

logstash

kibana

Christopher Bruce

People also ask

2 Answers

The `_source` Field

Individual stored fields

The `_all` Field

Analysis in general

More reading

Nick Zadrozny

petedogg

Recent Activity

Donate For Us

Elasticsearch index much larger than the actual size of the logs it indexed?

Tags:

indexing

logging

elasticsearch

logstash

kibana

Christopher Bruce

People also ask

2 Answers

The _source Field

Individual stored fields

The _all Field

Analysis in general

More reading

Nick Zadrozny

petedogg

Related questions

Recent Activity

Donate For Us

The `_source` Field

The `_all` Field