Django haystack EdgeNgramField given different results than elasticsearch

Question

I'm currently running haystack with an elasticsearch backend, and now I'm building an autocomplete for cities names. The problem is that SearchQuerySet is giving me different results, which from my perspective are wrong, than the same query executed directly in elasticsearch, which are for me the expected results.

I'm using: Django 1.5.4, django-haystack 2.1.0, pyelasticsearch 0.6.1, elasticsearch 0.90.3

Using the following example data:

Midfield
Midland City
Midway
Minor
Minturn
Miami Beach

Using either

SearchQuerySet().models(Geoname).filter(name_auto='mid')
or
SearchQuerySet().models(Geoname).autocomplete(name_auto='mid')

The result returns always all the 6 names, including Min* and Mia*...however, querying elasticsearch directly returns the right data:

"query": {
    "filtered" : {
        "query" : {
            "match_all": {}
        },
        "filter" : {
             "term": {"name_auto": "mid"}
        }
    }
}

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 1,
      "hits": [
         {
            "_index": "haystack",
            "_type": "modelresult",
            "_id": "csi.geoname.4075977",
            "_score": 1,
            "_source": {
               "name_auto": "Midfield",
            }
         },
         {
            "_index": "haystack",
            "_type": "modelresult",
            "_id": "csi.geoname.4075984",
            "_score": 1,
            "_source": {
               "name_auto": "Midland City",
            }
         },
         {
            "_index": "haystack",
            "_type": "modelresult",
            "_id": "csi.geoname.4075989",
            "_score": 1,
            "_source": {
               "name_auto": "Midway",
            }
         }
      ]
   }
}

The behavior is the same with different examples. My guess is that trough haystack the string it's being split and analyzed by all possible "min_gram" groups of characters and that's why it returns wrong results.

I'm not sure if I am doing or understanding something wrong, and if is this how haystack is supposed to work, but I need that haystack results match the elasticsearch results.

So, How can I fix the issue or make it works ?

My summarized objects look as follow:

Model:

class Geoname(models.Model):
    id = models.IntegerField(primary_key=True)
    name = models.CharField(max_length=255)

Index:

class GeonameIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    name_auto = indexes.EdgeNgramField(model_attr='name')

    def get_model(self):
        return Geoname

Mapping:

modelresult: {
    _boost: {
        name: "boost",
        null_value: 1
    },
    properties: {
        django_ct: {
            type: "string"
        },
        django_id: {
            type: "string"
        },
        name_auto: {
            type: "string",
            store: true,
            term_vector: "with_positions_offsets",
            analyzer: "edgengram_analyzer"
        }
    }
}

Thank you.

tufla · Accepted Answer

After a deep look into the code I found that the search generated by haystack was:

{
  "query":{
     "filtered":{
        "filter":{
           "fquery":{
              "query":{
                 "query_string":{
                    "query": "django_ct:(csi.geoname)"
                 }
              },
              "_cache":false
           }
        },
        "query":{
           "query_string":{
              "query": "name_auto:(mid)",
              "default_operator":"or",
              "default_field":"text",
              "auto_generate_phrase_queries":true,
              "analyze_wildcard":true
           }
        }
     }
  },
  "from":0,
  "size":6
}

Running this query in elasticsearch was given me as result the same 6 objects that haystack was showing...but If I added to the "query_string"

"analyzer": "standard"

it worked as desired. So the idea was to be able to setup a different search analyzer for the field.

Based on the @user954994 answer's link and the explanation on this post, what I finally did to make it work was:

I created my custom elasticsearch backend, adding a new custom analyzer based on the standard one.
I added a custom EdgeNgramField, enabling the way to setup an specific analyzer for index (index_analyzer) and another analyzer for search (search_analyzer).

So, my new settings are:

ELASTICSEARCH_INDEX_SETTINGS = {
    'settings': {
        "analysis": {
            "analyzer": {
                "ngram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["haystack_ngram"]
                },
                "edgengram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["haystack_edgengram"]
                },
                "suggest_analyzer": {
                    "type":"custom",
                    "tokenizer":"standard",
                    "filter":[
                        "standard",
                        "lowercase",
                        "asciifolding"
                    ]
                },
            },
            "tokenizer": {
                "haystack_ngram_tokenizer": {
                    "type": "nGram",
                    "min_gram": 3,
                    "max_gram": 15,
                },
                "haystack_edgengram_tokenizer": {
                    "type": "edgeNGram",
                    "min_gram": 2,
                    "max_gram": 15,
                    "side": "front"
                }
            },
            "filter": {
                "haystack_ngram": {
                    "type": "nGram",
                    "min_gram": 3,
                    "max_gram": 15
                },
                "haystack_edgengram": {
                    "type": "edgeNGram",
                    "min_gram": 2,
                    "max_gram": 15
                }
            }
        }
    }
}

My new custom build_schema method looks as follow:

def build_schema(self, fields):
    content_field_name, mapping = super(ConfigurableElasticBackend,
                                          self).build_schema(fields)

    for field_name, field_class in fields.items():
        field_mapping = mapping[field_class.index_fieldname]

        index_analyzer = getattr(field_class, 'index_analyzer', None)
        search_analyzer = getattr(field_class, 'search_analyzer', None)
        field_analyzer = getattr(field_class, 'analyzer', self.DEFAULT_ANALYZER)

        if field_mapping['type'] == 'string' and field_class.indexed:
            if not hasattr(field_class, 'facet_for') and not field_class.field_type in('ngram', 'edge_ngram'):
                field_mapping['analyzer'] = field_analyzer

        if index_analyzer and search_analyzer:
            field_mapping['index_analyzer'] = index_analyzer
            field_mapping['search_analyzer'] = search_analyzer
            del(field_mapping['analyzer'])

        mapping.update({field_class.index_fieldname: field_mapping})
    return (content_field_name, mapping)

And after rebuild index my mapping looks as below:

modelresult: {
   _boost: {
       name: "boost",
       null_value: 1
   },
   properties: {
       django_ct: {
           type: "string"
       },
       django_id: {
           type: "string"
       },
       name_auto: {
           type: "string",
           store: true,
           term_vector: "with_positions_offsets",
           index_analyzer: "edgengram_analyzer",
           search_analyzer: "suggest_analyzer"
       }
   }
}

Now everything is working as expected!

UPDATE:

Bellow you'll find the code to clarify this part:

I created my custom elasticsearch backend, adding a new custom analyzer based on the standard one.

I added a custom EdgeNgramField, enabling the way to setup an specific analyzer for index (index_analyzer) and another analyzer for search (search_analyzer).

Into my app search_backends.py:

from django.conf import settings
from haystack.backends.elasticsearch_backend import ElasticsearchSearchBackend
from haystack.backends.elasticsearch_backend import ElasticsearchSearchEngine
from haystack.fields import EdgeNgramField as BaseEdgeNgramField


# Custom Backend 
class CustomElasticBackend(ElasticsearchSearchBackend):

    DEFAULT_ANALYZER = None

    def __init__(self, connection_alias, **connection_options):
        super(CustomElasticBackend, self).__init__(
                                connection_alias, **connection_options)
        user_settings = getattr(settings, 'ELASTICSEARCH_INDEX_SETTINGS', None)
        self.DEFAULT_ANALYZER = getattr(settings, 'ELASTICSEARCH_DEFAULT_ANALYZER', "snowball")
        if user_settings:
            setattr(self, 'DEFAULT_SETTINGS', user_settings)

    def build_schema(self, fields):
        content_field_name, mapping = super(CustomElasticBackend,
                                              self).build_schema(fields)

        for field_name, field_class in fields.items():
            field_mapping = mapping[field_class.index_fieldname]

            index_analyzer = getattr(field_class, 'index_analyzer', None)
            search_analyzer = getattr(field_class, 'search_analyzer', None)
            field_analyzer = getattr(field_class, 'analyzer', self.DEFAULT_ANALYZER)

            if field_mapping['type'] == 'string' and field_class.indexed:
                if not hasattr(field_class, 'facet_for') and not field_class.field_type in('ngram', 'edge_ngram'):
                    field_mapping['analyzer'] = field_analyzer

            if index_analyzer and search_analyzer:
                field_mapping['index_analyzer'] = index_analyzer
                field_mapping['search_analyzer'] = search_analyzer
                del(field_mapping['analyzer'])

            mapping.update({field_class.index_fieldname: field_mapping})
        return (content_field_name, mapping)


class CustomElasticSearchEngine(ElasticsearchSearchEngine):
    backend = CustomElasticBackend


# Custom field
class CustomFieldMixin(object):

    def __init__(self, **kwargs):
        self.analyzer = kwargs.pop('analyzer', None)
        self.index_analyzer = kwargs.pop('index_analyzer', None)
        self.search_analyzer = kwargs.pop('search_analyzer', None)
        super(CustomFieldMixin, self).__init__(**kwargs)


class CustomEdgeNgramField(CustomFieldMixin, BaseEdgeNgramField):
    pass

My index definition goes something like:

class MyIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    name_auto = CustomEdgeNgramField(model_attr='name', index_analyzer="edgengram_analyzer", search_analyzer="suggest_analyzer")

And finally, settings uses of course the custom backend for the haystack connection definition:

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'my_app.search_backends.CustomElasticSearchEngine',
        'URL': 'http://localhost:9200',
        'INDEX_NAME': 'index'
    },
}

Django haystack EdgeNgramField given different results than elasticsearch

Tags:

python

django

elasticsearch

django-haystack

tufla

1 Answers

tufla

Recent Activity

Donate For Us

Django haystack EdgeNgramField given different results than elasticsearch

Tags:

python

django

elasticsearch

django-haystack

tufla

1 Answers

tufla

Related questions

Recent Activity

Donate For Us