Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

inputting arbitrary xml in solr

Tags:

xml

solr

I have a question on Apache Solr. If I have an arbitrary XML file, and a XSD that it conforms to, how do I input it into Solr. Could I get a code sample? I know you have to parse the XML and put the relevant data in a solr input doc, but I don't understand how to do that.

like image 642
John Doe Avatar asked Feb 20 '23 06:02

John Doe


2 Answers

The DataImportHandler (DIH) allows you to pass the incoming XML to an XSL, as well as parse and transform the XML with DIH transformers. You could translate your arbitrary XML to Solr's standard input XML format via XSL, or map/transform the arbitrary XML to the Solr schema fields right there in the DIH config file, or a combination of both. DIH is flexible.

Sample dih-config.xml

Here's a sample dih-config.xml from a an actual working site (no pseudo-samples here, my friend). Note that it picks up xml files from a local directory on the LAMP server. If you prefer to post xml files directly via HTTP you would need to configure a ContentStreamDataSource instead.

It so happens that the incoming xml is already in standard Solr update xml format in this sample, and all the XSL does is remove empty field nodes, while the real transforms, such as building the content of "ispartof_t" from "ignored_seriestitle", "ignored_seriesvolume", and "ignored_seriesissue", are done with DIH Regex transformers. (The XSLT is performed first, and the output of that is then given to the DIH transformers.) The attribute "useSolrAddSchema" tells DIH that the xml is already in standard Solr xml format. If that were not the case, another attribute, "xpath", on the XPathEntityProcessor would be required to select content from the incoming xml document.

<dataConfig>
    <dataSource encoding="UTF-8" type="FileDataSource" />
    <document>
        <!--
            Pickupdir fetches all files matching the filename regex in the supplied directory
            and passes them to other entities which parse the file contents. 
        -->
        <entity
            name="pickupdir"
            processor="FileListEntityProcessor"
            rootEntity="false"
            dataSource="null"
            fileName="^[\w\d-]+\.xml$"
            baseDir="/var/lib/tomcat6/solr/cci/import/"
            recursive="true"
            newerThan="${dataimporter.last_index_time}"
        >

        <!--
            Pickupxmlfile parses standard Solr update XML.
            Incoming values are split into multiple tokens when given a splitBy attribute.
            Dates are transformed into valid Solr dates when given a dateTimeFormat to parse.
        -->
        <entity 
            name="xml"
            processor="XPathEntityProcessor"
            transformer="RegexTransformer,TemplateTransformer"
            datasource="pickupdir"
            stream="true"
            useSolrAddSchema="true"
            url="${pickupdir.fileAbsolutePath}"
            xsl="xslt/dih.xsl"
        >

            <field column="abstract_t" splitBy="\|" />
            <field column="coverage_t" splitBy="\|" />
            <field column="creator_t" splitBy="\|" />
            <field column="creator_facet" template="${xml.creator_t}" />
            <field column="description_t" splitBy="\|" />
            <field column="format_t" splitBy="\|" />
            <field column="identifier_t" splitBy="\|" />
            <field column="ispartof_t" sourceColName="ignored_seriestitle" regex="(.+)" replaceWith="$1" />
            <field column="ispartof_t" sourceColName="ignored_seriesvolume" regex="(.+)" replaceWith="${xml.ispartof_t}; vol. $1" />
            <field column="ispartof_t" sourceColName="ignored_seriesissue" regex="(.+)" replaceWith="${xml.ispartof_t}; no. $1" />
            <field column="ispartof_t" regex="\|" replaceWith=" " />
            <field column="language_t" splitBy="\|" />
            <field column="language_facet" template="${xml.language_t}" />
            <field column="location_display" sourceColName="ignored_class" regex="(.+)" replaceWith="$1" />
            <field column="location_display" sourceColName="ignored_location" regex="(.+)" replaceWith="${xml.location_display} $1" />
            <field column="location_display" regex="\|" replaceWith=" " />
            <field column="othertitles_display" splitBy="\|" />
            <field column="publisher_t" splitBy="\|" />
            <field column="responsibility_display" splitBy="\|" />
            <field column="source_t" splitBy="\|" />
            <field column="sourceissue_display" sourceColName="ignored_volume" regex="(.+)" replaceWith="vol. $1" />
            <field column="sourceissue_display" sourceColName="ignored_issue" regex="(.+)" replaceWith="${xml.sourceissue_display}, no. $1" />
            <field column="sourceissue_display" sourceColName="ignored_year" regex="(.+)" replaceWith="${xml.sourceissue_display} ($1)" />
            <field column="src_facet" template="${xml.src}" />
            <field column="subject_t" splitBy="\|" />
            <field column="subject_facet" template="${xml.subject_t}" />
            <field column="title_t" sourceColName="ignored_title" regex="(.+)" replaceWith="$1" />
            <field column="title_t" sourceColName="ignored_subtitle" regex="(.+)" replaceWith="${xml.title_t} : $1" />
            <field column="title_sort" template="${xml.title_t}" />
            <field column="toc_t" splitBy="\|" />
            <field column="type_t" splitBy="\|" />
            <field column="type_facet" template="${xml.type_t}" />
    </entity>
      </entity>
    </document>
</dataConfig>

To set up DIH:

  • Ensure the DIH jars are referenced from solrconfig.xml, as they are not included by default in the Solr WAR file. One easy way is to create a lib folder in the Solr instance directory that includes the DIH jars, as the solrconfig.xml looks in the lib folder for references by default. Find the DIH jars in the apache-solr-x.x.x/dist folder when you download the Solr package.

dist folder: solr dih jars location

  • Create your dih-config.xml (as above) in the Solr "conf" directory.

  • Add a DIH request handler to solrconfig.xml if it's not there already.

request handler:

<requestHandler name="/update/dih" startup="lazy" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">dih-config.xml</str>
</lst>
</requestHandler>

To trigger DIH:

There is a lot more info re full-import vs. delta-import and whether to commit, optimize, etc. in the wiki description on Data Import Handler Commands, but the following would trigger the DIH operation without deleting the existing index first, and commit the changes after all the files had been processed. The sample given above would collect all the files found in the pickup directory, transform them, index them, and finally, commit the update/s to the index (which would make them searchable the instant commit was finished).

http://localhost:8983/solr/update/dih?command=full-import&clean=false&commit=true
like image 68
Peaeater Avatar answered Mar 03 '23 21:03

Peaeater


the easiest way might be to use DataImportHandler, it allows you to apply a XSL first to transform your xml to Solr input xml

like image 36
Persimmonium Avatar answered Mar 03 '23 19:03

Persimmonium