Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr / Sunspot - determine indexing language at runtime, dynamically choose analyzers

I would like to use Solr + Sunspot to index a bilingual FR-EN site. The issue: model Post can be written both in French or in English. I can determine at runtime what is the language, but I also need Solr to index the model accordingly.

EG: For French models, I would need a French stemmer,

<filter class="solr.SnowballPorterFilterFactory" language="French"/>

What are my options? Can I change Solr analyzers at runtime? Can I make a set of analyzers for each language?

like image 399
Vlad Zloteanu Avatar asked Dec 22 '10 11:12

Vlad Zloteanu


2 Answers

This is a great question, and a feature that's being discussed for inclusion in Sunspot.

Sunspot uses dynamic field naming conventions to set up its schema. For example, here are two existing definitions for text fields:

<dynamicField name="*_text" stored="false" type="text" multiValued="true" indexed="true"/>
<dynamicField name="*_texts" stored="true" type="text" multiValued="true" indexed="true"/>

These correspond to the fieldType name="text" defined earlier in the schema.

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

You could add a similar definition for the different languages you'd like to index (as Mauricio also mentions), and then set up some new dynamicField definitions to use them.

1. A fieldType definition for a French text field

<fieldType name="text_fr" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="French"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

2. A dynamicField definition for the French text field

<dynamicField name="*_text_fr" stored="false" type="text" multiValued="true" indexed="true"/>
<dynamicField name="*_texts_fr" stored="true" type="text" multiValued="true" indexed="true"/>

3. Using the French text field in Sunspot

The latest Sunspot 1.2 (not quite released — use 1.2.rc4) supports an :as option which lets you specify the field name.

searchable do
  text :description, :as => 'description_text_fr'
end

Like I said, this is something I'm thinking of adding to Sunspot 1.3 or 1.4. Personally, I'd like to see something like :lang => :en on a text field definition to choose the appropriate field definition. Do feel free to chime in on the Sunspot mailing list with your thoughts!

like image 122
Nick Zadrozny Avatar answered Oct 06 '22 00:10

Nick Zadrozny


Can't say anything about Sunspot, but in pure Solr I'd create separate field types in your Solr schema (one fieldType for French, another for English), then create one field for English content (using the English fieldType) and another field for French content (using the French fieldType).

Since you know which language to use at runtime, you'd just pick one field or the other to run your searches and get results.

like image 25
Mauricio Scheffer Avatar answered Oct 06 '22 00:10

Mauricio Scheffer