Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Defining nested entities in Solr Data Import Handler

Tags:

solr

Let me preface by mentioning that I've been through everything I could find about this topic including the Solr docs and all of the SO questions.

I have a Solr instance that I've setup with a Data Import Hanlder to pull in data from MSSQL using the JDBC driver. The data comes in, but it isn't structured as I'd expect based on the Solr DIH documentation

<document>
 <entity>
  <entity />
 </entity>
</document>

I've tried all the attributes like rootEntity, flatten, using CachedSqlProvider, etc. With multiValued="True" The result ends up

docs [
{
  recordId: '1234',
  name: 'whatever'
  subrows_col1: ['x','y','z']
  subrows_col2: ['a','b','c']
}
]

When I'm looking for

docs [
{
  recordId: '1234',
  name: 'whatever'
  subrows: [{
     col1: 'x',
     col2: 'a'
 },
  {
     col1: 'y',
     col2: 'b'
 },
 {
     col1: 'z',
     col2: 'c'
 }]
} ]

I've seen the block-join stuff, but I'm confused as to where it goes. I added

<add>
 <doc>
  <field />
  <doc>
   <field />
  </doc>
 <doc>
</add>

to the DIH requestHandler, but it did nothing. I added it to the /update requestHandler and I got an error. I have no clue where that is supposed to go. Does it only work during a query or is it only for when you push data to solr via /update?

Where do I define the structure for the document? I tried nested fields in the schema, entities in the DIH config and the block-join stuff in the requestHandlers. nothing has worked yet.

Obviously I'm missing something.

like image 679
Dustin Davis Avatar asked Dec 11 '22 04:12

Dustin Davis


1 Answers

Indexing nested document in DIH is finally supported from Solr 5.1 onwards.

https://issues.apache.org/jira/browse/SOLR-5147

Simply adding child=true to the child entity, then Solr DIH will automagically indexes as child document.

Example taken from JIRA (in the link above) :

<document>
  <entity name='PARENT' query='select * from PARENT'>
    <field column='id' />
    <field column='desc' />
    <field column='type_s' />
    <entity child='true' name='CHILD' query="select * from CHILD where parent_id='${PARENT.id}'">
      <field column='id' />
      <field column='desc' />
      <field column='type_s' />
  </entity>
</entity>
</document>

I've also decompiled DocBuilder.class in solr-dataimporthandler-5.3.0.jar, found this code snippet : -

if (doc != null) {
    if (epw.getEntity().isChild())
    {
        childDoc = new DocWrapper();
        handleSpecialCommands(arow, childDoc);
        addFields(epw.getEntity(), childDoc, arow, vr);
        doc.addChildDocument(childDoc);
    }
    else
    {
        handleSpecialCommands(arow, doc);
        addFields(epw.getEntity(), doc, arow, vr);
    }
}

Noticed that if epw.getEntity().isChild() will return true if child="true" is set, thus it's creating a new DocWrapper and add as child document instead of simply adding the entity as a bunch of new fields.

like image 109
aheryan Avatar answered Feb 11 '23 10:02

aheryan