Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr: transform a comma-delimited field during data import

Tags:

solr

I'm working with Solr 3.5.0. I am importing from a JDBC data source and have a delimited field that I would like split into individual values. I'm using the RegexTransformer but my field isn't being split.

sample value

Bob,Carol,Ted,Alice

data-config.xml

<dataConfig>
  <dataSource driver="..." />
  <document>
    <entity name="ent"
            query="SELECT id,names FROM blah"
            transformer="RegexTransformer">
      <field column="id" />
      <field column="names" splitBy="," />
    </entity>
  </document>
</dataConfig>

schema.xml

<schema name="mytest" version="1.0">
  <types>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
               omitNorms="true"/>
    <fieldType name="integer" class="solr.IntField" omitNorms="true"/>
  </types>
  <fields>
    <field name="id" type="integer" indexed="false" stored="true"
           multiValued="false" required="true" />
    <field name="name" type="string" indexed="true" stored="true"
           multiValued="true" required="true" />
  </fields>
</schema>

When I search : I get a result doc element like this:

<doc>
  <int name="id">22</int>
  <arr name="names">
    <str>Bob,Carol,Ted,Alice</str>
  </arr>
</doc>

I was hoping to get this instead:

<doc>
  <int name="id">22</int>
  <arr name="names">
    <str>Bob</str>
    <str>Carol</str>
    <str>Ted</str>
    <str>Alice</str>
  </arr>
</doc>

It's quite possible I misunderstand the RegexTransformer section of the wiki. I've tried changing my delimiter and I've tried using a different field for the parts (as shown in the wiki)...

<field column="name" splitBy="," sourceColName="names" />

...but that resulted in an empty name field. What am I doing wrong?

like image 462
Paul Avatar asked Dec 02 '22 23:12

Paul


2 Answers

I handled a similar issue by creating a fieldtype in the schema file:

<fieldType name="commaDelimited" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.PatternTokenizerFactory" pattern=",\s*" />
      </analyzer>
</fieldType>

Then I applied that type to the field to the data field like:

<field name="features" type="commaDelimited" indexed="true" stored="true"/>
like image 175
dhysong Avatar answered Dec 04 '22 13:12

dhysong


Your database column is called names while the Solr field is called name (Notice the missing s). One solution is to use the following in your DIH config and then re-index.

<field name="name" column="names" splitBy=","/>
like image 25
nikhil500 Avatar answered Dec 04 '22 12:12

nikhil500