Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

not_indexed field is stored in index

I'm trying to optimize my elasticsearch scheme.

I have a field which is a URL - I do not want to be able to query or filter it, just retreive it.

My understanding is that a field that is defined as "index":"no" is not indexed, but is still stored in the index. (see slide 5 in http://www.slideshare.net/nitin_stephens/lucene-basics) This should match to Lucene UnIndexed, right?

This confuses me, is there a way to store some fields, without them taking more storage than simply their content, and without encumbering the index for the other fields?

What am I missing?

like image 520
eran Avatar asked Jun 04 '13 07:06

eran


2 Answers

I'm new to posting on stack exchange but believe I can help a bit!

There are a few considerations here:

Analyzing

As you don't want to do extra work you should set "index": "no". This will mean the field will not be run through any tokenizers and filters.

Furthermore it will not be searchable when directing a query at the specific field: (no hits)

"query": {
    "term": {
        "url": "http://www.domain.com/exact/url/that/was/sent/to/elasticsearch"
    }
}

*here "url" is the field name.

However the field will still be searchable in the _all field: (might have a hit)

"query": {
    "term": {
        "_all": "http://www.domain.com/exact/url/that/was/sent/to/elasticsearch"
    }
}

_all field

By default every field gets put in the _all field. Set "include_in_all": "false" to stop that. This might not be an issue with you as you are unlikely to search against the _all field with a URL by mistake.

I was working with a schema where countries were stored as 2 letter codes, e.g.: "NO" means Norway, and it is possible someone might do a search against the all field with "NO", so I make sure to set "include_in_all": "false".

Note: Any query where you don't specify a field explicitly will be executed against the _all field.

Storing

By default, elasticsearch will store your entire document (unanalyzed, as you sent it) and this will be returned to you in a hit's _source field. If you turn this off (if your elasticsearch db is getting huge perhaps?) then you need to explicitly set "store": "yes" to store fields individually. (One thing to notice is that store takes yes or no and not true or false - it tripped me up!)

Note: if you do this you will need to request the fields you want returned to you explicitly. e.g.:

curl -XGET http://path/index_name/type_name/id?fields=url,another_field

finally...

I would leave elasticsearch to store your whole document (as the default) and use the following mapping.

"type_name": {
    "properties": {
        "url": {
            "type": "string",
            "index": "no",
            "include_in_all": "false"
        },
        // other fields' mappings
    }
}

Source: elasticsearch documentation

like image 163
ramseykhalaf Avatar answered Sep 21 '22 06:09

ramseykhalaf


There are two ways to input data into the index. Indexing and Storing. Indexing a piece of data means that it is tokenized, and placed in the inverted index, and can be searched. Storing data means it is not tokenized, or analyzed or anything, and is not added to the inverted index. It is stored in an entirely separate area, in it's full text form. It can not be searched against, but can be retrieved, in it's original form, by it's document ID.

The typical Lucene query process is to query against indexed data, and get the back Document IDs of matching documents, then to use those document IDs to retrieve the stored data for those documents, and display it to the user.

Data which is indexed, but not stored is searchable, but can not be retrieved in it's original form.

Data which is stored, but not indexed can be retrieved once you have found a hit, but is not searchable.

Data which is indexed and stored can be searched or retrieved.

Data which is neither can not be added to the index at all.

This is covered a bit in the Lucene FAQ.

like image 24
femtoRgon Avatar answered Sep 24 '22 06:09

femtoRgon