Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Troubles creating an index with custom analyzer using Jest

Jest provides a brilliant async API for elasticsearch, we find it very usefull. However, sometimes it turns out that resulting requests are slightly different than what we would expect.

Usually we didn't care, since everything was working fine, but in this case it was not.

I want to create an index with a custom ngram analyzer. When I do this following the elasticsearch rest API docs, I call below:

curl -XPUT 'localhost:9200/test' --data '
{
  "settings": {
    "number_of_shards": 3,
    "analysis": {
      "filter": {
        "keyword_search": {
          "type":     "edge_ngram",
          "min_gram": 3,
          "max_gram": 15
        }
      },
      "analyzer": {
        "keyword": {
          "type":      "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "keyword_search"
          ]
        }
      }
    }
  }
}'

and then I confirm the analyzer is configured properly using:

curl -XGET 'localhost:9200/test/_analyze?analyzer=keyword&text=Expecting many tokens

in response I receive multiple tokens like exp, expe, expec and so on.

Now using Jest client I put the config json to a file on my classpath, the content is exactly the same as the body of the PUT request above. I execute the Jest action constructed like this:

new CreateIndex.Builder(name)
            .settings(
                    ImmutableSettings.builder()
                            .loadFromClasspath(
                                    "settings.json"
                            ).build().getAsMap()
            ).build();

In result

  • Primo - checked with tcpdump that what's actually posted to elasticsearch is (pretty printed):

    {
      "settings.analysis.filter.keyword_search.max_gram": "15",
      "settings.analysis.filter.keyword_search.min_gram": "3",
      "settings.analysis.analyzer.keyword.tokenizer": "whitespace",
      "settings.analysis.filter.keyword_search.type": "edge_ngram",
      "settings.number_of_shards": "3",
      "settings.analysis.analyzer.keyword.filter.0": "lowercase",
      "settings.analysis.analyzer.keyword.filter.1": "keyword_search",
      "settings.analysis.analyzer.keyword.type": "custom"
    }
    
  • Secundo - the resulting index settings is:

    {
      "test": {
        "settings": {
          "index": {
            "settings": {
              "analysis": {
                "filter": {
                  "keyword_search": {
                    "type": "edge_ngram",
                    "min_gram": "3",
                    "max_gram": "15"
                  }
                },
                "analyzer": {
                  "keyword": {
                    "filter": [
                      "lowercase",
                      "keyword_search"
                    ],
                    "type": "custom",
                    "tokenizer": "whitespace"
                  }
                }
              },
              "number_of_shards": "3"   <-- the only difference from the one created with rest call
            },
            "number_of_shards": "3",
            "number_of_replicas": "0",
            "version": {"created": "1030499"},
            "uuid": "Glqf6FMuTWG5EH2jarVRWA"
          }
        }
      }
    }
    
  • Tertio - checking the analyzer with curl -XGET 'localhost:9200/test/_analyze?analyzer=keyword&text=Expecting many tokens I get just one token!

Question 1. What is the reason that Jest does not post my original settings json, but some processed one instead?

Question 2. Why the settings generated by Jest are not working?

like image 680
macias Avatar asked Feb 12 '23 13:02

macias


1 Answers

Glad you found Jest useful, please see my answer below.

Question 1. What is the reason that Jest does not post my original settings json, but some processed one instead?

It's not Jest but the Elasticsearch's ImmutableSettings doing that, see:

    Map test = ImmutableSettings.builder()
            .loadFromSource("{\n" +
                    "  \"settings\": {\n" +
                    "    \"number_of_shards\": 3,\n" +
                    "    \"analysis\": {\n" +
                    "      \"filter\": {\n" +
                    "        \"keyword_search\": {\n" +
                    "          \"type\":     \"edge_ngram\",\n" +
                    "          \"min_gram\": 3,\n" +
                    "          \"max_gram\": 15\n" +
                    "        }\n" +
                    "      },\n" +
                    "      \"analyzer\": {\n" +
                    "        \"keyword\": {\n" +
                    "          \"type\":      \"custom\",\n" +
                    "          \"tokenizer\": \"whitespace\",\n" +
                    "          \"filter\": [\n" +
                    "            \"lowercase\",\n" +
                    "            \"keyword_search\"\n" +
                    "          ]\n" +
                    "        }\n" +
                    "      }\n" +
                    "    }\n" +
                    "  }\n" +
                    "}").build().getAsMap();
    System.out.println("test = " + test);

outputs:

test = {
    settings.analysis.filter.keyword_search.type=edge_ngram,
    settings.number_of_shards=3,
    settings.analysis.analyzer.keyword.filter.0=lowercase,
    settings.analysis.analyzer.keyword.filter.1=keyword_search,
    settings.analysis.analyzer.keyword.type=custom,
    settings.analysis.analyzer.keyword.tokenizer=whitespace,
    settings.analysis.filter.keyword_search.max_gram=15,
    settings.analysis.filter.keyword_search.min_gram=3
}

Question 2. Why the settings generated by Jest are not working?

Because your usage of settings JSON/map is not the intended case. I have created this test to reproduce your case (it's a bit long but bear with me):

    @Test
    public void createIndexTemp() throws IOException {
        String index = "so_q_26949195";

        String settingsAsString = "{\n" +
                "  \"settings\": {\n" +
                "    \"number_of_shards\": 3,\n" +
                "    \"analysis\": {\n" +
                "      \"filter\": {\n" +
                "        \"keyword_search\": {\n" +
                "          \"type\":     \"edge_ngram\",\n" +
                "          \"min_gram\": 3,\n" +
                "          \"max_gram\": 15\n" +
                "        }\n" +
                "      },\n" +
                "      \"analyzer\": {\n" +
                "        \"keyword\": {\n" +
                "          \"type\":      \"custom\",\n" +
                "          \"tokenizer\": \"whitespace\",\n" +
                "          \"filter\": [\n" +
                "            \"lowercase\",\n" +
                "            \"keyword_search\"\n" +
                "          ]\n" +
                "        }\n" +
                "      }\n" +
                "    }\n" +
                "  }\n" +
                "}";
        Map settingsAsMap = ImmutableSettings.builder()
                .loadFromSource(settingsAsString).build().getAsMap();

        CreateIndex createIndex = new CreateIndex.Builder(index)
                .settings(settingsAsString)
                .build();

        JestResult result = client.execute(createIndex);
        assertTrue(result.getErrorMessage(), result.isSucceeded());

        GetSettings getSettings = new GetSettings.Builder().addIndex(index).build();
        result = client.execute(getSettings);
        assertTrue(result.getErrorMessage(), result.isSucceeded());
        System.out.println("SETTINGS SENT AS STRING settingsResponse = " + result.getJsonString());

        Analyze analyze = new Analyze.Builder()
                .index(index)
                .analyzer("keyword")
                .source("Expecting many tokens")
                .build();
        result = client.execute(analyze);
        assertTrue(result.getErrorMessage(), result.isSucceeded());
        Integer actualTokens = result.getJsonObject().getAsJsonArray("tokens").size();
        assertTrue("Expected multiple tokens but got " + actualTokens, actualTokens > 1);

        analyze = new Analyze.Builder()
                .analyzer("keyword")
                .source("Expecting single token")
                .build();
        result = client.execute(analyze);
        assertTrue(result.getErrorMessage(), result.isSucceeded());
        actualTokens = result.getJsonObject().getAsJsonArray("tokens").size();
        assertTrue("Expected single token but got " + actualTokens, actualTokens == 1);

        admin().indices().delete(new DeleteIndexRequest(index)).actionGet();

        createIndex = new CreateIndex.Builder(index)
                .settings(settingsAsMap)
                .build();

        result = client.execute(createIndex);
        assertTrue(result.getErrorMessage(), result.isSucceeded());

        getSettings = new GetSettings.Builder().addIndex(index).build();
        result = client.execute(getSettings);
        assertTrue(result.getErrorMessage(), result.isSucceeded());
        System.out.println("SETTINGS AS MAP settingsResponse = " + result.getJsonString());

        analyze = new Analyze.Builder()
                .index(index)
                .analyzer("keyword")
                .source("Expecting many tokens")
                .build();
        result = client.execute(analyze);
        assertTrue(result.getErrorMessage(), result.isSucceeded());
        actualTokens = result.getJsonObject().getAsJsonArray("tokens").size();
        assertTrue("Expected multiple tokens but got " + actualTokens, actualTokens > 1);
    }

When you run it you'll see that the case where settingsAsMap is used the actual settings is totally wrong (settings includes another settings which is your JSON but they should have been merged) and so the analyze fails.

Why is this not the intended usage?

Simply because that's how Elasticsearch behaves in this situation. If the settings data is flattened (as it is done by default by the ImmutableSettings class) then it should not have the top level element settings but it can have the same top level element if data is not flattened (and that's why the test case with settingsAsString works).

tl;dr:

Your settings JSON should not include the top level "settings" element (if you run it through ImmutableSettings).

like image 74
Cihan Keser Avatar answered Feb 14 '23 20:02

Cihan Keser