Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

difference between text_general and text_en in solr?

I find I can use different tokenizer/analyzer for different language for text_general field.
But there exists text_en as well.

Why do we need two?

Suppose we have a sentence in an asian language and the sentence also contains some english words.
text_general is used for the asian words in the sentence and text_en for english words?
How would solr index/query such sentences?

like image 950
eugene Avatar asked Jun 07 '13 01:06

eugene


2 Answers

text_en uses stemming so if you search fakes, you can match fake, fake's, faking etc. With a non-stemmed field fakes will match only fakes.

Each field uses a different "chain" of analyzers. The text_en uses a chain of filters that index english better. See the tokenizer and filter entries.

Schema excerpt for text_general:

<!-- A general text field that has reasonable, generic
     cross-language defaults: it tokenizes with StandardTokenizer,
 removes stop words from case-insensitive "stopwords.txt"

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>

Schema excerpt for text_en:

<!-- A text field with defaults appropriate for English: it
     tokenizes with StandardTokenizer, removes English stop words
     (lang/stopwords_en.txt), down cases, protects words from protwords.txt, and
     finally applies Porter's stemming.  The query time analyzer
     also applies synonyms from synonyms.txt. -->
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
like image 114
Jesvin Jose Avatar answered Nov 13 '22 19:11

Jesvin Jose


Why do we need two?

So that you can analyze different content differently. Or you can even analyze the same content differently (with a copyField) if you want. This gives you more choices at query time about which field you want to query.

text_general is used for the asian words in the sentence and text_en for english words?

No, each field can only have one fieldType, just like a database.

If you want to do different analysis for different languages within the same field, then you can see SmartChineseAnalyzer for an example.

Also see http://docs.lucidworks.com/display/LWEUG/Multilingual+Indexing+and+Search

like image 34
arun Avatar answered Nov 13 '22 20:11

arun