Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch multiple analyzers for a single field

I store different kinds of documents in a single index with strict predefined mapping. All of them have some field (say, "body"), but I'd want them to be analyzed slightly differently when indexed (for example, to use different token filters for specific documents) and treaten the same way while searched. As far as I know, analyzers can't be specified per document.

What I also considered to use:

  1. Object fields with differently analyzed subfields for document kinds, so each document has only one filled subfield (like, "body.mail", "body.html"). The problem is that I couldn't search on the whole "body" field which would look through all its subfields (to not break the existing application).
  2. New reincarnation of multi-fields (to have "body" field with a generic analyzer and custonly analyzed "mail", "html", etc. inside it). Hovewer, I'm not sure if it's possible to use them directly while indexing and indirectly while searching (e.g., to save object with {"mail":"smth"} to use a specific index analyzer, then search by "query":{"body":"smth"} to use generic search analyzer).
  3. To separate "body" into several fields with different mappings, remove them from _all, and set copy_to to a single body field. I'm not sure, but it will add a substantial index overhead due to copying.
like image 322
Yuuri Avatar asked Jun 19 '15 16:06

Yuuri


People also ask

Can Elasticsearch index have multiple mappings?

To create a mapping, you will need the Put Mapping API that will help you to set a specific mapping definition for a specific type, or you can add multiple mappings when you create an index.

What are analyzed fields in Elasticsearch?

Text field typeedit These fields are analyzed , that is they are passed through an analyzer to convert the string into a list of individual terms before being indexed. The analysis process allows Elasticsearch to search for individual words within each full text field.

What is field mapping in Elasticsearch?

Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. Each document is a collection of fields, which each have their own data type. When mapping your data, you create a mapping definition, which contains a list of fields that are pertinent to the document.


2 Answers

I think you can use multi-field. With multi-field you can define analyzers (both indexing & searching) for each sub fields, and do the search on corresponding fields base on applications requirements. In general, index analyzer can be difference from field to field, the same for search analyzer.

{
  "your_type" : {   
    "properties":{
        "body" : {
            "type" : "string",
            "index" : "analyzed",
            "index_analyzer" : "index_body_analyzer",
            "search_analyzer" : "search_body_analyzer",
            "fields" : {
                "mail" : {
                    "type" : "string",
                    "index" : "analyzed",
                    "index_analyzer" : "index_bodymail_analyzer",
                    "search_analyzer" : "search_bodymail_analyzer"
                },
                "html": {
                    "type" : "string",              
                    "index" : "analyzed",
                    "index_analyzer" : "index_bodyhtml_analyzer",
                    "search_analyzer" : "search_bodyhtml_analyzer"
                }
            }
        }
    }
}
like image 115
Duc.Duong Avatar answered Sep 28 '22 02:09

Duc.Duong


As I mentioned in the comments, what you want is not possible. Your requirement, in one sentence, is: have the same data analyzed in multiple ways, but searched as a single field because this would break the existing application.

             -- body.html          
             -- body.email
body field ---- body.content     --- all searched as "body"
            ...
             -- body.destination
             -- body.whatever
  • Your first option is multi-fields which has this exact purpose in mind: have the same data analyzed multiple ways. The problem is that you cannot search for "body" and expect ES to search body.html, body.email... Even if this would be possible, you want to be searched with different analyzers. Again, not possible. This option requires you to change the application and search for each field in a multi_match or in a query_string.

  • Your second option - reincarnation of multi-fields - will again not work because you cannot refer to body and ES, in the background, to match mail, content etc.

  • Third option - using copy_to - will not work because copying to another field "X" means indexing the data being copied will be analyzed with X's analyzer, and this breaks your requirement of having the same data analyzed differently.

  • There could be a fourth option - "path": "just_name" from multi_fields - which at a first look it should work. Meaning, you can have 3 multi-fields (email, content, html) which all three have a body sub-field. Having "path": "just_name" allows you to search just for body even if body is a sub-field of multiple other fields. But this is not possible because this type of multi-fields will not accept different analyzers for the same body.

Either way, you need to change something in your requirements, because they will not work they way you want it.


These being said, I'm curious to see what queries are you using in your application. It would be a simple change (yes, you will need to change your app) from querying body field to querying body.* in a multi_match.

And I have another solution for you: create multiple indices, one index for each analyzer of your body. For example, for mail, content and html you define three indices:

PUT /multi_fields1
{
  "mappings": {
    "test": {
      "properties": {
        "body": {
          "type": "string",
          "index_analyzer": "whitespace",
          "search_analyzer": "standard"
        }
      }
    }
  }
}
PUT /multi_fields2
{
  "mappings": {
    "test": {
      "properties": {
        "body": {
          "type": "string",
          "index_analyzer": "standard",
          "search_analyzer": "standard"
        }
      }
    }
  }
}
PUT /multi_fields3
{
  "mappings": {
    "test": {
      "properties": {
        "body": {
          "type": "string",
          "index_analyzer": "keyword",
          "search_analyzer": "standard"
        }
      }
    }
  }
}

You see that all of them have the same type and the same field name - body - but different index_analyzers. Then you define an alias:

POST _aliases
{
  "actions": [
    {"add": {
        "index": "multi_fields1",
        "alias": "multi"}},
    {"add": {
        "index": "multi_fields2",
        "alias": "multi"}},
    {"add": {
        "index": "multi_fields3",
        "alias": "multi"}}
  ]
}

Name your alias the same as your current index. The application doesn't need to change, it will use the same name for index search, but this name will not point to an index, but to an alias which in turn refers to your multiple indices. What needs to change is how you index the documents, because a html documents needs to go in multi_fields1 index for example, an email document needs to be index in multi_fields2 index etc.

Whatever solution you find/choose, your requirements need to change because the way you want it is not possible.

like image 34
Andrei Stefan Avatar answered Sep 28 '22 04:09

Andrei Stefan