Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to configure Haystack/Elasticsearch to handle contractions and apostrophes near the start of a word

I'm having a hell of a time trying to deal with apostrophe characters near the start or the middle of words. I am able to deal with possessive English, but I am also trying to cater for French and handle words like "d'action", where an apostrophe character comes at the start of the word and not at the end like "her's".

A search via haystack auto_query for "d action" will return results, but "d'action" returns nothing. If I query the elasticsearch _search API (_search?q=D%27ACTION) directly I do get results for "d'action". I am therefore wondering if this is a haystack engine issue.

My configuration:

'settings': {
    "analysis": {
        "char_filter": {
            "quotes": {
                "type": "mapping",
                "mappings": [
                    "\\u0091=>\\u0027",
                    "\\u0092=>\\u0027",
                    "\\u2018=>\\u0027",
                    "\\u2019=>\\u0027",
                    "\\u201B=>\\u0027"
                ]
            }
        },
        "analyzer": {
            "ch_analyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": ['ch_en_possessive_word_delimiter', 'ch_fr_stemmer'],
                "char_filter": ['html_strip', 'quotes'],
            },
        },

        "filter": {
            "ch_fr_stemmer" : {
                "type": "snowball",
                "language": "French"
            },
            "ch_en_possessive_word_delimiter": {
                "type": "word_delimiter",
                "stem_english_possessive": True
            }
        }
    }
}

I have also subclassed ElasticsearchSearchBackend and BaseEngine so I can add the above configuration:

class ConfigurableESBackend(ElasticsearchSearchBackend):
    # Word reserved by Elasticsearch for special use.
    RESERVED_WORDS = (
        'AND',
        'NOT',
        'OR',
        'TO',
    )

    # Characters reserved by Elasticsearch for special use.
    # The '\\' must come first, so as not to overwrite the other slash replacements.
    RESERVED_CHARACTERS = (
        '\\', '+', '-', '&&', '||', '!', '(', ')', '{', '}',
        '[', ']', '^', '"', '~', '*', '?', ':',
    )

    def setup(self):
        """
        Defers loading until needed.
        """
        # Get the existing mapping & cache it. We'll compare it
        # during the ``update`` & if it doesn't match, we'll put the new
        # mapping.
        try:
            self.existing_mapping = self.conn.get_mapping(index=self.index_name)
        except Exception:
            if not self.silently_fail:
                raise

        unified_index = haystack.connections[self.connection_alias].get_unified_index()
        self.content_field_name, field_mapping = self.build_schema(unified_index.all_searchfields())
        current_mapping = {
            'modelresult': {
                'properties': field_mapping,
                '_boost': {
                    'name': 'boost',
                    'null_value': 1.0
                }
            }
        }

        if current_mapping != self.existing_mapping:
            try:
                # Make sure the index is there first.
                self.conn.create_index(self.index_name, settings.ELASTICSEARCH_INDEX_SETTINGS)
                self.conn.put_mapping(self.index_name, 'modelresult', mapping=current_mapping)
                self.existing_mapping = current_mapping
            except Exception:
                if not self.silently_fail:
                    raise

        self.setup_complete = True

class CHElasticsearchSearchEngine(BaseEngine):
    backend = ConfigurableESBackend
    query = ElasticsearchSearchQuery
like image 290
arc Avatar asked Sep 04 '14 14:09

arc


1 Answers

Ok so this had nothing to do with configuration but was instead an issue with the .txt template used for haystack indexing.

I had:

{{ object.some_model.name_en }}
{{ object.some_model.name_fr }}

Which was causing characters like ' to be converted to html entitles ('), which caused the search to never find the result. Using "safe" fixed the issue:

{{ object.some_model.name_en|safe }}
{{ object.some_model.name_fr|safe }}
like image 140
arc Avatar answered Sep 21 '22 01:09

arc