Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hierarchical faceted search example with Solr

Question

Where can I find a complete example that shows how hierarchical faceted search works from indexing the documents to retrieving search results?

My research so far

Stackoverflow has a few posts, but all of them only address certain aspects of hierarchical faceted search; therefore, I wouldn't consider them to be duplicates. I'm looking for a complete example to understand it. I keep missing the last query where the aggregations work.

  • This would be pretty much exactly what I am looking for, but again, not a complete walkthrough: Solr Hierarchical Faceting. Example needed

There is documentation on the Solr webpage, but didn't understand the example given there.

  • https://wiki.apache.org/solr/HierarchicalFaceting

Example (conceptually)

I'd like to create a complete walkthrough example here and hope you can provide the missing final piece.

Testdata

Input

Let's say we have 3 documents with each document being a person.

Alice (document 1)
 - Blond
 - Europe

Jane (document 2)
 - Brown
 - Europe/Norway

Bob (document 3)
 - Brown
 - Europe/Norway
 - Europe/Sweden

Output

The expected output for this (currently wrong) query

http://server:8983/solr/my_core/select?q=*%3A*&wt=json&indent=true&facet=true&facet.field=tags_ss

should be

Hair_color (3)
- blond (1)
- brown (1)
- black (1)

Location (3)
- Europe (4)  // This should be 4 not 3, i.e. the sum of the leaves, because Alice is tagged with "Europe" only, without a country
  - Norway (2)
  - Sweden (1)

because all documents are found.

Example (programmatically)

This is where I require help. How do I implement the above conceptual example?

Here is how far I've gotten.

1. Create the test data XML

This is the content of the documents.xml file in the solr-5.1.0/testdata subfolder:

<add>
    <doc>
        <field name="id">Alice</field>
        <field name="tags_ss">hair_color/blond</field>
        <field name="tags_ss">location/Europe</field>
    </doc>
    <doc>
        <field name="id">Jane</field>
        <field name="tags_ss">hair_color/brown</field>
        <field name="tags_ss">location/Europe/Norway</field>
    </doc>
    <doc>
        <field name="id">Bob</field>
        <field name="tags_ss">hair_color/black</field>
        <field name="tags_ss">location/Europe/Norway</field>
        <field name="tags_ss">location/Europe/Sweden</field>
    </doc>
</add>

The _ss is defined in schema.xml as

<dynamicField name="*_ss" type="string"  indexed="true"  stored="true" multiValued="true"/>

Note that all tags, e.g. hair_color and location and anything tags that will be added in the future, are stored in the same tags_ss field.

2. Index the test data with Solr

c:\solr-5.1.0>java -classpath dist/solr-core-5.1.0.jar -Dauto=yes -Dc=gettingstarted -Ddata=files -Drecursive=yes -Durl=http://server:8983/solr/my_core/update org.apache.solr.util.SimplePostTool .\testdata

Solr statistics page

3. Retrieve all data with a Solr query (without faceting)

Query

http://server:8983/solr/my_core/select?q=*%3A*&wt=json&indent=true

Result

{
  "responseHeader": {
    "status": 0,
    "QTime": 0,
    "params": {
      "indent": "true",
      "q": "*:*",
      "_": "1430830360536",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 3,
    "start": 0,
    "docs": [
      {
        "id": "Alice",
        "tags_ss": [
          "hair_color/blond",
          "location/europe"
        ],
        "_version_": 1500334369469890600
      },
      {
        "id": "Jane",
        "tags_ss": [
          "hair_color/brown",
          "location/europe/Norway"
        ],
        "_version_": 1500334369469890600
      },
      {
        "id": "Bob",
        "tags_ss": [
          "hair_color/black",
          "location/europe/Norway",
          "location/europe/Sweden"
        ],
        "_version_": 1500334369469890600
      }
    ]
  }
}

4. Retrieve all data with a Solr query (with faceting)

Query

http://server:8983/solr/my_core/select?q=*%3A*&wt=json&indent=true&facet=true&facet.field=tags_ss

Result

{
  "responseHeader": {
    "status": 0,
    "QTime": 0,
    "params": {
      "facet": "true",
      "indent": "true",
      "q": "*:*",
      "_": "1430830432389",
      "facet.field": "tags_ss",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 3,
    "start": 0,
    "docs": [
      {
        "id": "Alice",
        "tags_ss": [
          "hair_color/blond",
          "location/europe"
        ],
        "_version_": 1500334369469890600
      },
      {
        "id": "Jane",
        "tags_ss": [
          "hair_color/brown",
          "location/europe/Norway"
        ],
        "_version_": 1500334369469890600
      },
      {
        "id": "Bob",
        "tags_ss": [
          "hair_color/black",
          "location/europe/Norway",
          "location/europe/Sweden"
        ],
        "_version_": 1500334369469890600
      }
    ]
  },
  "facet_counts": {
    "facet_queries": {},
    "facet_fields": {
      "tags_ss": [
        "location/europe/Norway",
        2,
        "hair_color/black",
        1,
        "hair_color/blond",
        1,
        "hair_color/brown",
        1,
        "location/europe",
        1,
        "location/europe/Sweden",
        1
      ]
    },
    "facet_dates": {},
    "facet_ranges": {},
    "facet_intervals": {},
    "facet_heatmaps": {}
  }
}

Note this section at the bottom of the result:

"facet_fields": {
  "tags_ss": [
    "location/europe/Norway",
    2,
    "hair_color/black",
    1,
    "hair_color/blond",
    1,
    "hair_color/brown",
    1,
    "location/europe",
    1,
    "location/europe/Sweden",
    1
  ]
},

It shows all tags as a flat list (not hierarchical).

5. Retrieve all data with a Solr query (with hierarchical faceting)

Query

Here is my problem. I don't know how to construct the query which returns the following result (the result already shown in the conceptual example above).

Result (fictitious, created by hand for illustration)

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "facet":"true",
      "indent":"true",
      "q":"*:*",
      "facet.field":"tags_ss",
      "wt":"json",
      "rows":"0"}},
  "response":{"numFound":3,"start":0,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "tags_ss":[
        "hair_color,3, // This aggregations is missing
        "hair_color/black",1,
        "hair_color/blond",1,
        "hair_color/brown",1,
        "location/europe",4, // This aggregation should be 4 but is 1
        "location/europe/Norway",2,
        "location/europe/Sweden",1]},
    "facet_dates":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}

This tags list is still flat, but at least location/europe = 4 would be correctly aggregated, but currently it is not. I keep getting location/europe = 1 because it's only set for Alice and Bob's Norway and Sweden are not aggregated to also count towards Europe.

Ideas

  • I might need to use facet.pivot, but I don't know how.
  • I might need to use facet.prefix, but I don't know how.

Versions

  • Solr 5.1.0
  • Windows 7
like image 214
Lernkurve Avatar asked May 05 '15 13:05

Lernkurve


People also ask

What is faceted search in Solr?

What Is Faceted Search? Faceted search is the dynamic clustering of items or search results into categories that let users drill into search results (or even skip searching entirely) by any value in any field. Each facet displayed also shows the number of hits within the search that match that category.


1 Answers

You can get all of your aggregations to be populated if you push them into the index in stages. If Bob is from Norway, you might populate up to three values in your facet field:

location
location/Europe
location/Europe/Norway

(As an alternate design, you might have a hair color field separate from the location field, and then "location" would never need to be populated in the field itself.)

Then your results are still flat but your aggregated totals are present. At that point, you will need to do some programmatic work with the result set to create a nested data structure built by splitting all of the values on your separator character (/ in this case). Once you have a nested data structure, then displaying it hierarchically should be manageable. It's hard to go into detail about this part of the implementation because your nested data structure and display will depend heavily on your development environment.

Another, somewhat risky, option to avoid adding repetitive entries into your Solr facet field is to add only the value you're using now (e.g. location/Europe/Norway), but to sum the leaf totals as your iterate through the facet list and build your nested data structure. The risk there is that if a person is genuinely associated with multiple countries in Europe, then you might get an inflated total for the higher level location/Europe. I have chosen in my own projects to populate the separate values, as above. Even though they seem redundant, the aggregate totals end up being more accurate.

(As usual in Solr, this is only one of quite a few ways of doing things. This model works best for systems with a manageable number of total leaves, where it makes sense to retrieve all of the facet values up front and not have to make additional drill-down queries.)

A pivoting option

Solr facet pivoting can return a hierarchically-structured result directly from Solr, but runs the risk of creating false connections between values in certain situations.

So, say you load your documents like this:

<add>
 <doc>
  <field name="id">Alice</field>
  <field name="continent">Europe</field>
 </doc>
 <doc>
  <field name="id">Jane</field>
  <field name="continent">Europe</field>
  <field name="country">Norway</field>
 </doc>
 <doc>
  <field name="id">Bob</field>
  <field name="continent">Europe</field>
  <field name="country">Norway</field>
  <field name="country">Sweden</field>
 </doc>
</add>

Now you perform a facet pivot query with facet.pivot.mincount=1&facet.pivot=continent,country. The results can be great so far:

"facet_pivot":{
 "continent,country":[{
  "field":"continent",
  "value":"Europe",
  "count":3,
  "pivot":[{
    "field":"country",
    "value":"Norway",
    "count":2,},
      {
    "field":"country",
    "value":"Sweden",
    "count":1,}]}]}

So far so good. The problem comes when you add a new person to the data:

<add>
 <doc>
  <field name="id">Susan</field>
  <field name="continent">Europe</field>
  <field name="country">Norway</field>
  <field name="continent">South America</field
  <field name="country">Brazil</field>
 </doc>
</add>

Now Solr doesn't actually know that Norway is in Europe and Brazil is in South America, so you will begin to get facet counts for "Europe > Brazil" and for "South America > Norway".

The problem is resolvable if you add continent prefixes to all of your country values:

<add>
 <doc>
  <field name="id">Susan</field>
  <field name="continent">Europe</field>
  <field name="country">Europe/Norway</field>
  <field name="continent">South America</field
  <field name="country">South America/Brazil</field>
 </doc>
</add>

This way you will still get the mismatched pivot values, but you can choose to block any country-level facet values that don't have a prefix matching their continent. For this to be an issue, a multivalued field in the pivot must have values associated with values appearing later in the same pivot. If you are not expecting to have multiple values for these fields in a single record or if your values don't have a strong association (i.e. specific parentage), pivot facets can be an ideal solution. But in some cases, the pivot facet's disassociation between values in the included fields can create a prohibitive mess.

like image 77
frances Avatar answered Oct 20 '22 10:10

frances