I would like to use Solr + Sunspot to index a bilingual FR-EN site. The issue: model Post can be written both in French or in English. I can determine at runtime what is the language, but I also need Solr to index the model accordingly.
EG: For French models, I would need a French stemmer,
<filter class="solr.SnowballPorterFilterFactory" language="French"/>
What are my options? Can I change Solr analyzers at runtime? Can I make a set of analyzers for each language?
This is a great question, and a feature that's being discussed for inclusion in Sunspot.
Sunspot uses dynamic field naming conventions to set up its schema. For example, here are two existing definitions for text fields:
<dynamicField name="*_text" stored="false" type="text" multiValued="true" indexed="true"/>
<dynamicField name="*_texts" stored="true" type="text" multiValued="true" indexed="true"/>
These correspond to the fieldType name="text"
defined earlier in the schema.
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
You could add a similar definition for the different languages you'd like to index (as Mauricio also mentions), and then set up some new dynamicField
definitions to use them.
fieldType
definition for a French text field<fieldType name="text_fr" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="French"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
dynamicField
definition for the French text field<dynamicField name="*_text_fr" stored="false" type="text" multiValued="true" indexed="true"/>
<dynamicField name="*_texts_fr" stored="true" type="text" multiValued="true" indexed="true"/>
The latest Sunspot 1.2 (not quite released — use 1.2.rc4) supports an :as
option which lets you specify the field name.
searchable do
text :description, :as => 'description_text_fr'
end
Like I said, this is something I'm thinking of adding to Sunspot 1.3 or 1.4. Personally, I'd like to see something like :lang => :en
on a text field definition to choose the appropriate field definition. Do feel free to chime in on the Sunspot mailing list with your thoughts!
Can't say anything about Sunspot, but in pure Solr I'd create separate field types in your Solr schema (one fieldType for French, another for English), then create one field for English content (using the English fieldType) and another field for French content (using the French fieldType).
Since you know which language to use at runtime, you'd just pick one field or the other to run your searches and get results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With