Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing a comma-separated value field in Elastic Search

I'm using Nutch to crawl a site and index it into Elastic search. My site has meta-tags, some of them containing comma-separated list of IDs (that I intend to use for search). For example:

contentTypeIds="2,5,15". (note: no square brackets).

When ES indexes this, I can't search for contentTypeIds:5 and find documents whose contentTypeIds contain 5; this query returns only the documents whose contentTypeIds is exactly "5". However, I do want to find documents whose contentTypeIds contain 5.

In Solr, this is solved by setting the contentTypeIds field to multiValued="true" in the schema.xml. I can't find how to do something similar in ES.

I'm new to ES, so I probably missed something. Thanks for your help!

like image 853
Yann Avatar asked Jun 30 '15 16:06

Yann


1 Answers

Create custom analyzer which will split indexed text into tokens by commas.

Then you can try to search. In case you don't care about relevance you can use filter to search through your documents. My example shows how you can attempt search with term filter.

Below you can find how to do this with sense plugin.

DELETE testindex

PUT testindex
{
    "index" : {
        "analysis" : {
            "tokenizer" : {
                "comma" : {
                    "type" : "pattern",
                    "pattern" : ","
                }
            },
            "analyzer" : {
                "comma" : {
                    "type" : "custom",
                    "tokenizer" : "comma"
                }
            }
        }
    }
}

PUT /testindex/_mapping/yourtype
{
        "properties" : {
            "contentType" : {
                "type" : "string",
                "analyzer" : "comma"
            }
        }
}

PUT /testindex/yourtype/1
{
    "contentType" : "1,2,3"
}

PUT /testindex/yourtype/2
{
    "contentType" : "3,4"
}

PUT /testindex/yourtype/3
{
    "contentType" : "1,6"
}

GET /testindex/_search
{
    "query": {"match_all": {}}
}

GET /testindex/_search
{
    "filter": {
        "term": {
           "contentType": "6"
        }
    }
}

Hope it helps.

like image 111
Rob Avatar answered Oct 17 '22 21:10

Rob