Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spring data elastic search wild card search

I am trying to search for the word blue in the below list of text

"BlueSaphire","Bluo","alue","blue", "BLUE", "Blue","Blue Black","Bluo","Saphire Blue", "black" , "green","bloo" , "Saphireblue"

SearchQuery searchQuery = new NativeSearchQueryBuilder().withIndices("color")
                  .withQuery(matchQuery("colorDescriptionCode", "blue")
                  .fuzziness(Fuzziness.ONE)
                  )
                  .build();

This works fine and the search result returns the below records along with the scores

alue    2.8718023
Bluo    1.7804208
Bluo    1.7804208
BLUE    1.2270637
blue    1.2270637
Blue    1.2270637
Blue Black    1.1082436
Saphire Blue    0.7669148

But I am not able to make wild card work . "SaphireBlue" and "BlueSaphire" is also expected to be part of the result

I tried the below setting but it does not work .

SearchQuery searchQuery = new NativeSearchQueryBuilder().withIndices("color")
                      .withQuery(matchQuery("colorDescriptionCode", "(.*?)blue")
                      .fuzziness(Fuzziness.ONE)
                      )
                      .build();

In stack overflow , I observed a solution to specify analyze wild card .

QueryBuilder queryBuilder = boolQuery().should(
                queryString("blue").analyzeWildcard(true)
                        .field("colorDescriptionCode", 2.0f);

I dont find the queryString static method . I am using spring-data-elasticsearch 2.0.0.RELEASE .

Let me know how i can specify the wild card so the all words containing blue will also be returned in the search results

like image 300
lives Avatar asked Jan 03 '23 15:01

lives


1 Answers

I know that working examples are always better than theory, but still, I would first like to tell a little theory. The heart of the Elasticsearch is Lucene. So before document will be written to Lucene index, he goes through analysis stage. The analysis stage can be divided into 3 parts:

  1. char filtering;
  2. tokenizing;
  3. token filtering

In the first stage, we can throw away unwanted characters, for example, HTML tags. More information about character filters, you can find on official site. Next stage is far more interesting. Here we split input text to tokens, which will be used later for searching. A few very useful tokenizers:

  • standard tokenizer. It's used by default. The tokenizer implements the Unicode Text Segmentation algorithm. In practice, you can use this to split the text into words and use this words as tokens.
  • n-gram tokenizer. This is what you need if you want to search by part of the word. This tokenizer splits text to a contiguous sequence of n items. For example text "for example" will be splitted to this sequence of tokens "fo", "or", "r ", " e", "ex", "for", "or ex" etc. The length of n-gram is variable and can be configured by min_gram and max_gram params.
  • edge n-gram tokenizer. Work the same as n-gram tokenizer except for one thing - this tokenizer doesn't increment offset. For example text "for example" will be splitted to this sequence of tokens "fo", "for", "for ", "for e", "for ex", "for exa" etc. More information about tokenizers you can find on the official site. Unfortunately, I can't post more links because of low reputation.

The next stage is also damn interesting. After we split text to tokens, we can do a lot of interesting things with this. Again I give a few very useful examples of token filters:

  • lowercase filter. In most cases, we want to get case-insensitive search, so it's good practice to bring tokens to lowercase.
  • stemmer filter. When we have a deal with natural language, we have a lot of problems. One of the problem is that one word can have many forms. Stemmer filter helps us to get root form of the word.
  • fuzziness filter. Another problem is that users often make typos. This filter adds tokens that contain possible typos.

If you are interested in looking at the result of the analyzer, you can use this _termvectors endpoint

curl [ELASTIC_URL]:9200/[INDEX_NAME]/[TYPE_NAME]/[DOCUMENT_ID]/_termvectors?pretty

Now talk about queries. Queries are divided into 2 large groups. These groups have 2 significant differences:

  1. Whether the request will go through the analysis stage or not;
  2. Does the request require an exact answer (yes or no)

Examples are the match query and term query. The first will pass the stage of analysis, the second not. The first will not give us a specific answer (but give us a score), the second will does. When creating mappings for a document, we can specify both the index of the analyzer and the search analyzer separately per field.

Now information regarding spring data elasticsearch. Here it makes sense to talk about concrete examples. Suppose that we have a document with a title field and we want to search for information on this field. First, create a file with settings for elasticsearch.

{
 "analysis": {
    "analyzer": {
        "ngram_analyzer": {
            "tokenizer": "ngram_tokenizer",
            "filter": [
                "lowercase"
            ]
        },
        "edge_ngram_analyzer": {
            "tokenizer": "edge_ngram_tokenizer",
            "filter": [
                "lowercase"
            ]
        },
        "english_analyzer": {
            "tokenizer": "standard",
            "filter": [
                "lowercase",
                "english_stop",
                "unique",
                "english_possessive_stemmer",
                "english_stemmer"
            ]
        "keyword_analyzer": {
            "tokenizer": "keyword",
            "filter": ["lowercase"]
        }

   },
   "tokenizer": {
       "ngram_tokenizer": {
           "type": "ngram",
           "min_gram": 2,
           "max_gram": 20
       },
       "edge_ngram_tokenizer": {
           "type": "edge_ngram",
           "min_gram": 2,
           "max_gram": 20
       }
   },
   "filter": {
       "english_stop": {
           "type": "stop",
           "stopwords": "_english_"
       },
   "english_stemmer": {
       "type": "stemmer",
       "language": "english"
   },
   "english_possessive_stemmer": {
       "type": "stemmer",
       "language": "possessive_english"
   }
 }    
}

You can save this settings to your resource folder. Now let's see to our document class

@Document(indexName = "document", type = "document")
@Setting(settingPath = "document_index_setting.json")
public class Document {

    @Id
    private String id;

    @MultiField(
        mainField = @Field(type = FieldType.String, 
                           index = not_analyzed),
        otherFields = {
                @InnerField(suffix = "edge_ngram",
                        type = FieldType.String,
                        indexAnalyzer = "edge_ngram_analyzer",
                        searchAnalyzer = "keyword_analyzer"),
                @InnerField(suffix = "ngram",
                        type = FieldType.String,
                        indexAnalyzer = "ngram_analyzer"),
                        searchAnalyzer = "keyword_analyzer"),
                @InnerField(suffix = "english",
                        type = FieldType.String,
                        indexAnalyzer = "english_analyzer")
        }
    )
    private String title;

    // getters and setters omitted

}

So here field title with three inner fields:

  • title.edge_ngram for searching by edge n-grams with keyword search analyzer. We need this because we don't need that our query be splitted to edge n-grams;
  • title.ngram for searching by n-grams;
  • title.english for searching with the nuances of a natural language And main field title. We don't analyze this because sometimes we want to sort by this field. Let's use simple multi match query for searching through all this fields:
String searchQuery = "blablabla";
MultiMatchQueryBuilder queryBuilder = multiMatchQuery(searchQuery)
    .field("title.edge_ngram", 2)
    .field("title.ngram")
    .field("title.english");
NativeSearchQueryBuilder searchBuilder = new NativeSearchQueryBuilder()
    .withIndices("document")
    .withTypes("document")
    .withQuery(queryBuilder)
    .withPageable(new PageRequest(page, pageSize));
elasticsearchTemplate.queryForPage(searchBuilder.build, 
                                   Document.class, 
                                   new SearchResultMapper() {
                                   //realisation omitted });

Search is a very interesting and voluminous topic. I tried to answer as briefly as possible, it is possible that because of this there were confusing moments - do not hesitate to ask.

like image 171
Nikita Klimov Avatar answered Jan 09 '23 12:01

Nikita Klimov