I find I can use different tokenizer/analyzer for different language for text_general
field.
But there exists text_en
as well.
Why do we need two?
Suppose we have a sentence in an asian language and the sentence also contains some english words.text_general
is used for the asian words in the sentence and text_en
for english words?
How would solr index/query such sentences?
text_en uses stemming so if you search fakes
, you can match fake
, fake's
, faking
etc. With a non-stemmed field fakes
will match only fakes
.
Each field uses a different "chain" of analyzers. The text_en uses a chain of filters that index english better. See the tokenizer
and filter
entries.
Schema excerpt for text_general:
<!-- A general text field that has reasonable, generic
cross-language defaults: it tokenizes with StandardTokenizer,
removes stop words from case-insensitive "stopwords.txt"
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
Schema excerpt for text_en:
<!-- A text field with defaults appropriate for English: it
tokenizes with StandardTokenizer, removes English stop words
(lang/stopwords_en.txt), down cases, protects words from protwords.txt, and
finally applies Porter's stemming. The query time analyzer
also applies synonyms from synonyms.txt. -->
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
Why do we need two?
So that you can analyze different content differently. Or you can even analyze the same content differently (with a copyField) if you want. This gives you more choices at query time about which field you want to query.
text_general is used for the asian words in the sentence and text_en for english words?
No, each field can only have one fieldType
, just like a database.
If you want to do different analysis for different languages within the same field, then you can see SmartChineseAnalyzer for an example.
Also see http://docs.lucidworks.com/display/LWEUG/Multilingual+Indexing+and+Search
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With