Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr Indexing Splitting Field On Delimiter

Tags:

xml

solr

I am trying to setup a Solr index with some data, however I would like to send one of my fields down as pipe delimited and have it split on the Solr end e.g.

<doc>
 <add>
  <field name="cat">a|b|c<field>
 </add>
</doc>

For a multi-valued field declared as

<field name="cat" type="str_split_on_pipe" indexed="true" stored="true" multiValued="true" omitNorms="true" />

And the split on pipe type is

<fieldType name="str_split_on_pipe" class="solr.TextField" positionIncrementGap="100" >
  <analyzer type="index">
      <tokenizer class="solr.PatternTokenizerFactory" pattern="\|\s*" />
      <filter class="solr.LowerCaseFilterFactory"/>
      <!--<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>-->
      <!-- this filter can remove any duplicate tokens that appear at the same position - sometimes
     possible with WordDelimiterFilter in conjuncton with stemming. -->
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
      <tokenizer class="solr.PatternTokenizerFactory" pattern="\|\s*" />
      <filter class="solr.LowerCaseFilterFactory"/>
      <!--<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>-->
      <!-- this filter can remove any duplicate tokens that appear at the same position - sometimes
     possible with WordDelimiterFilter in conjuncton with stemming. -->
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

I would expect this to be the same as if I send the document with three different cat fields, however it doesn't seem to do much and just keeps returning my pipe separated list.

Is what I am trying to do possible, and if so where have I gone wrong?

Thanks, Amar

like image 567
amarsuperstar Avatar asked Mar 31 '11 15:03

amarsuperstar


1 Answers

Using a PatternTokenizer will change only the internal representation and not the stored value. If you want Solr to treat it as a multi-valued field with multiple displayable values, then you need to send in 3 different cat fields.

If you are using DataImportHandler, then you can use the RegexTransformer to split the data.

like image 148
nikhil500 Avatar answered Oct 06 '22 15:10

nikhil500