Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's a good alternative for the output field in Elasticsearch 5.1 Completion Suggestions?

The first error I encountered when indexing my data in ES 5.1 was my Completion Suggestion mapping which contained an output field.

message [MapperParsingException[failed to parse]; nested: IllegalArgumentException[unknown field name [output], must be one of [input, weight, contexts]];]

So I removed it but now a lot of my Auto completions are incorrect because it returns the input it matched instead of the single output String.

After some googling I found this article from ES which mentioned the following:

As suggestions are document-oriented, suggestion metadata (e.g. output) should now be specified as a field in the document. The support for specifying output when indexing suggestion entries has been removed. Now suggestion result entry’s text is always the un-analyzed value of the suggestion’s input (same as not specifying output while indexing suggestions in pre-5.0 indices).

I've found that the original value is withing the _source field that is returned with the suggestion, but it's not really a solution for me because the key and structure changes based on the original object it comes from.

I can add an extra 'output' field on the original object to but this isn't a solution for me either because in some cases I have a structure like this:

{
    "id": "c2358e0c-7399-4665-ac2c-0bdd44597ac0",
    "synonyms": ["All available colours", "Colors"],
    "autoComplete": [{
        "input": ["colours available all", "available colours all", "available all colours", "colours all available", "all available colours", "all colours available"]
    }, {
        "input": ["colors"]
    }]
}

in ES 2.4 the structure was like this:

{
    "id": "c2358e0c-7399-4665-ac2c-0bdd44597ac0",
    "synonyms": ["All available colours", "Colors"],
    "SmartSynonym": [{
        "input": ["colours available all", "available colours all", "available all colours", "colours all available", "all available colours", "all colours available"],
        "output": ["All available colours"]
    }, {
        "input": ["colors"],
        "output": ["Colors"]
    }]
    }

This wasn't any problem when the 'output' field was present in every Autocomplete object.

How can I return the original value in ES 5.1 (ex. All available colours) when asking "colours available all" in an easy way without doing to much manual lookups.

Related Question from other user: Output field in autocomplete suggestion

like image 546
Glenn Van Schil Avatar asked Jan 02 '17 09:01

Glenn Van Schil


1 Answers

Updated Answer


We ended up removing the custom plugin from the original answer because it was hard to get it working in Elastic Cloud. Instead we just created a separate document for the autocompletions and removed them from all our other documents.

The object

public class Suggest{
    /*
     * Contains the actual value it needs to return
     * iphone 8 plus, plus iphone 8, 8 plus iphone, ...
     * will all result into iphone 8 plus for example
     */
    private String autocompleteOutput;
    /*
     * Contains the field and all the values of that field to autocomplete
     */
    private Map<String, AutoComplete> autoComplete;

    @JsonCreator
    Suggest() {
    }

    public Suggest(String autocompleteOutput, Map<String, AutoComplete> autoComplete) {
        this.autocompleteOutput = autocompleteOutput;
        this.autoComplete = autoComplete;
    }

    public String getAutocompleteOutput() {
        return autocompleteOutput;
    }

    public void setAutocompleteOutput(String autocompleteOutput) {
        this.autocompleteOutput = autocompleteOutput;
    }

    public Map<String, AutoComplete> getAutoComplete() {
        return autoComplete;
    }

    public void setAutoComplete(Map<String, AutoComplete> autoComplete) {
        this.autoComplete = autoComplete;
    }
}

public class AutoComplete {
    /*
     * Contains the permutation values from the lucene filter (see original answer
     */
    private String[] input;

    @JsonCreator
    AutoComplete() {
    }

    public AutoComplete(String[] input) {
        this.input = input;
    }

    public String[] getInput() {
        return input;
    }
}

with the following mapping

{
  "suggest": {
    "dynamic_templates": [
      {
        "autocomplete": {
          "path_match": "autoComplete.*",
          "match_mapping_type": "*",
          "mapping": {
            "type": "completion",
            "analyzer": "lowercase_keyword_analyzer"
          }
        }
      }
    ],
    "properties": {}
  }
}

This allows us to use the autocompleteOutput field from the _source

Original Answer


After some research I ended up creating a new Elasticsearch 5.1.1 plugin

Create a lucene filter

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute;

import java.io.IOException;
import java.util.*;

/**
 * Created by glenn on 13.01.17.
 */
public class PermutationTokenFilter extends TokenFilter {
    private final CharTermAttribute charTermAtt;
    private final PositionIncrementAttribute posIncrAtt;
    private final OffsetAttribute offsetAtt;
    private Iterator<String> permutations;
    private int origOffset;

    /**
     * Construct a token stream filtering the given input.
     *
     * @param input
     */
    protected PermutationTokenFilter(TokenStream input) {
        super(input);
        this.charTermAtt = addAttribute(CharTermAttribute.class);
        this.posIncrAtt = addAttribute(PositionIncrementAttribute.class);
        this.offsetAtt = addAttribute(OffsetAttribute.class);
    }

    @Override
    public final boolean incrementToken() throws IOException {
        while (true) {
            //see if permutations have been created already
            if (permutations == null) {
                //see if more tokens are available
                if (!input.incrementToken()) {
                    return false;
                } else {
                    //Get value
                    String value = String.valueOf(charTermAtt);
                    //permute over buffer value and create iterator
                    permutations = permutation(value).iterator();
                    origOffset = posIncrAtt.getPositionIncrement();
                }
            }
            //see if there are remaining permutations
            if (permutations.hasNext()) {
                //Reset the attribute to starting point
                clearAttributes();
                //use the next permutation
                String permutation = permutations.next();
                //add te permutation to the attributes and remove old attributes
                charTermAtt.setEmpty().append(permutation);
                posIncrAtt.setPositionIncrement(origOffset);
                offsetAtt.setOffset(0,permutation.length());
                //remove permutation from iterator
                permutations.remove();
                origOffset = 0;
                return true;
            }
            permutations = null;
        }
    }

    /**
     * Changes the order of a multi value keyword so the completion suggester still knows the original value without
     * tokenizing it if the users asks the words in a different order.
     *
     * @param value unpermuted value ex: Yellow Crazy Banana
     * @return Permuted values ex:
     * Yellow Crazy Banana,
     * Yellow Banana Crazy,
     * Crazy Yellow Banana,
     * Crazy Banana Yellow,
     * Banana Crazy Yellow,
     * Banana Yellow Crazy
     */
    private Set<String> permutation(String value) {
        value = value.trim().replaceAll(" +", " ");
        // Use sets to eliminate semantic duplicates (a a b is still a a b even if you switch the two 'a's in case one word occurs multiple times in a single value)
        // Switch to HashSet for better performance
        Set<String> set = new HashSet<String>();
        String[] words = value.split(" ");
        // Termination condition: only 1 permutation for a array of 1 word
        if (words.length == 1) {
            set.add(value);
        } else if (words.length <= 6) {
            // Give each word a chance to be the first in the permuted array
            for (int i = 0; i < words.length; i++) {
                // Remove the word at index i from the array
                String pre = "";
                for (int j = 0; j < i; j++) {
                    pre += words[j] + " ";
                }

                String post = " ";
                for (int j = i + 1; j < words.length; j++) {
                    post += words[j] + " ";
                }
                String remaining = (pre + post).trim();

                // Recurse to find all the permutations of the remaining words
                for (String permutation : permutation(remaining)) {
                    // Concatenate the first word with the permutations of the remaining words
                    set.add(words[i] + " " + permutation);
                }
            }
        } else {
            Collections.addAll(set, words);
            set.add(value);
        }
        return set;
    }
}

This filter will take the original input token "All available colours" and permute it into all the possible combinations (see original question)

Create the factory

import org.apache.lucene.analysis.TokenStream;
import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;


/**
 * Created by glenn on 16.01.17.
 */
public class PermutationTokenFilterFactory extends AbstractTokenFilterFactory {

    public PermutationTokenFilterFactory(IndexSettings indexSettings, Environment environment, String name, Settings settings) {
        super(indexSettings, name, settings);
    }

    public PermutationTokenFilter create(TokenStream input) {
        return new PermutationTokenFilter(input);
    }
}

This class is needed to provide the filter to the Elasticsearch plugin.

Create the Elasticsearch plugin

Follow this guide to setup the needed configuration for the Elasticsearch plugin.

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>be.smartspoken</groupId>
    <artifactId>permutation-plugin</artifactId>
    <version>5.1.1-SNAPSHOT</version>
    <packaging>jar</packaging>
    <name>Plugin: Permutation</name>
    <description>Permutation plugin for elasticsearch</description>
    <properties>
        <lucene.version>6.3.0</lucene.version>
        <elasticsearch.version>5.1.1</elasticsearch.version>
        <java.version>1.8</java.version>
        <log4j2.version>2.7</log4j2.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-api</artifactId>
            <version>${log4j2.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <version>${log4j2.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-test-framework</artifactId>
            <version>${lucene.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>${lucene.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>${lucene.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.elasticsearch</groupId>
            <artifactId>elasticsearch</artifactId>
            <version>${elasticsearch.version}</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>

    <build>
        <resources>
            <resource>
                <directory>src/main/resources</directory>
                <filtering>false</filtering>
                <excludes>
                    <exclude>*.properties</exclude>
                </excludes>
            </resource>
        </resources>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>2.6</version>
                <configuration>
                    <appendAssemblyId>false</appendAssemblyId>
                    <outputDirectory>${project.build.directory}/releases/</outputDirectory>
                    <descriptors>
                        <descriptor>${basedir}/src/main/assemblies/plugin.xml</descriptor>
                    </descriptors>
                </configuration>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.3</version>
                <configuration>
                    <source>${java.version}</source>
                    <target>${java.version}</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

</project>

Make sure you use the correct Elasticsearch, Lucene and Log4J(2) version.in you pom.xml file and provide the correct configuration files

import be.smartspoken.plugin.permutation.filter.PermutationTokenFilterFactory;
import org.elasticsearch.index.analysis.TokenFilterFactory;
import org.elasticsearch.indices.analysis.AnalysisModule;
import org.elasticsearch.plugins.AnalysisPlugin;
import org.elasticsearch.plugins.Plugin;

import java.util.HashMap;
import java.util.Map;

/**
 * Created by glenn on 13.01.17.
 */
public class PermutationPlugin extends Plugin implements AnalysisPlugin{

    @Override
    public Map<String, AnalysisModule.AnalysisProvider<TokenFilterFactory>> getTokenFilters() {
        Map<String, AnalysisModule.AnalysisProvider<TokenFilterFactory>> extra = new HashMap<>();
        extra.put("permutation", PermutationTokenFilterFactory::new);
        return extra;
    }
}

provide the factory to the plugin.

After you installed your new plugin you need to restart your Elasticsearch.

Use the plugin

Add a new custom analyzer that "mocks" the functionality of 2.x

            Settings.builder()
                .put("number_of_shards", 2)
                .loadFromSource(jsonBuilder()
                        .startObject()
                            .startObject("analysis")
                                .startObject("analyzer")
                                    .startObject("permutation_analyzer")
                                        .field("tokenizer", "keyword")
                                        .field("filter", new String[]{"permutation","lowercase"})
                                    .endObject()
                                .endObject()
                            .endObject()
                        .endObject().string())
                .loadFromSource(jsonBuilder()
                        .startObject()
                            .startObject("analysis")
                                .startObject("analyzer")
                                    .startObject("lowercase_keyword_analyzer")
                                        .field("tokenizer", "keyword")
                                        .field("filter", new String[]{"lowercase"})
                                    .endObject()
                                .endObject()
                            .endObject()
                        .endObject().string())
                .build();

Now the only you have to do is provide the custom analyzers to your object mapping

{
    "my_object": {
        "dynamic_templates": [{
            "autocomplete": {
                "path_match": "my.autocomplete.object.path",
                "match_mapping_type": "*",
                "mapping": {
                    "type": "completion",
                    "analyzer": "permutation_analyzer", /* custom analyzer */
                    "search_analyzer": "lowercase_keyword_analyzer" /* custom analyzer */
                }
            }
        }],
        "properties": {
            /*your other properties*/
        }
    }
}

This will also improve performace because you don't have to wait for building the permutations anymore.

like image 116
Glenn Van Schil Avatar answered Nov 15 '22 04:11

Glenn Van Schil