Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I exclude HTML content from my elasticsearch index?

I'm using Elasticsearch, and writing my own wrapper using WebRequest since NEST (the usual choice) bafflingly seems to lack the ability to insert an item and have the generated ID returned.

Anyway - no problems with the general method. But, any HTML content is indexed as-is, i.e. if I have <strong>test</strong> in a field, then a search for the query "strong" returns the item.

I've put this in elasticsearch.yml, based on a random message board post I found:

index:
    analysis:
        analyzer:
            htmlContentAnalyzer:
                type: custom
                tokenizer: standard
                filter: standard
                char_filter: html_strip

Then, I create an mapping thusly for my index 'content', item type 'news':

PUT http://localhost:9200/content/news/_mapping

{
    "news" : {
        "properties" : {
            "TextContent" : {
                "type" : "string",
                "index" : "analyzed",
                "analyzer" : "htmlContentAnalyzer",
                "store" : "yes"
                }
            }
        }
    }
}

The store/yes is just for "fun", it makes no difference. The above gives me a 200 OK.

However, the search returns the same results.

What doesn't help is that elasticsearch documentation seems appalling. Check out this page:

http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping.html

it gives you a brief rundown of what mapping is, and says more details are in the mapping section, i.e. this page:

http://www.elasticsearch.org/guide/reference/mapping/

...which seems to be truly terrible. There's nothing referring to the format/object graph I found - no mention of "properties", "type", "analyzer", "index" etc. There are some sections on the menu on the right, e.g. "_index", but they seem to refer to the item as a whole? And where is that pointed out?

So my question is on two fronts:

  • How do I stop HTML tags (and entities, attribute values I guess) being indexed? - I still want the HTML stored, mind you
  • Is there a better source for elasticsearch info/documentation? Or am I looking at it without the super-secret decoder glasses?
like image 490
Kieren Johnstone Avatar asked Oct 22 '22 07:10

Kieren Johnstone


1 Answers

With all credit to chrismale on #elasticsearch (freenode IRC) -

Searching against _all is no good: that is indexed with its own analyzer. Querying on my TextContent field specifically worked as expected.

like image 156
Kieren Johnstone Avatar answered Oct 31 '22 18:10

Kieren Johnstone