Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch aggregate on URL hostname

I am indexing documents with a field containing a url:

[
    'myUrlField' => 'http://google.com/foo/bar'
]

Now what I´d like to get out of elasticsearch is an aggregation on the url field.

curl -XGET 'http://localhost:9200/myIndex/_search?pretty' -d '{
  "facets": {
    "groupByMyUrlField": {
      "terms": {
        "field": "myUrlField"
      }
    }
  }
}'

This is all well and good, but the default analyzer tokenizes the field so that each part of the url is a token, so I get hits for http, google.com, foo and bar. But basically I am only interested in the hostname of the url, the google.com.

Can I use facets to group by a specific token?

"field": "myUrlField.0"

or something like that?

Querying for the "not_analyzed" index is also no good because I want to group by hostname, and not by unique urls.

Would love to be able to do this in elasticsearch and not in my client code. Thanks

like image 478
user1777136 Avatar asked May 26 '14 10:05

user1777136


1 Answers

Here is a way to aggregate urls by domains:

First you tokenize the full url as a single token using a keyword tokenizer (which works the same as not_analyzed under the hood), then you extract the domain with a regex, using a pattern capture token filter. Finally we discard the original full url token thanks to preserve_original option.

Which leads to:

{
  "settings": {
    "analysis": {
      "filter": {
        "capture_domain_filter": {
          "type": "pattern_capture",
          "preserve_original": false,
          "flags": "CASE_INSENSITIVE",
          "patterns": [
            "https?:\/\/([^/]+)"
          ]
        }
      },
      "analyzer": {
        "domain_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "capture_domain_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "weblink": {
      "properties": {
        "url": {
          "type": "string",
          "analyzer": "domain_analyzer"
        }
      }
    }
  }
}

We check how our urls are tokenized:

curl -sXGET http://localhost:9200/url_analyzer/_analyze\?analyzer\=domain_analyzer\&pretty -d 'http://en.wikipedia.org/wiki/Wikipedia' | grep token
  "tokens" : [ {
    "token" : "en.wikipedia.org",

This looks good, now let's aggregate our urls by domains using latest aggregations features (which will deprecate facets in near future).

curl -XGET "http://localhost:9200/url_analyzer/_search?pretty" -d'
{
  "aggregations": {
    "tokens": {
      "terms": {
        "field": "url"
      }
    }
  }
}'

Output:

"aggregations" : {
    "tokens" : {
      "buckets" : [ {
        "key" : "en.wikipedia.org",
        "doc_count" : 2
      }, {
        "key" : "www.elasticsearch.org",
        "doc_count" : 1
      } ]
    }

From here you can go further and apply an additional shingle token filter on top of this to match queries such as "en.wikipedia", "wikipedia.org", if you want to avoid exact matches while searching for a domain.

like image 185
Adrien Schuler Avatar answered Nov 20 '22 01:11

Adrien Schuler