Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

textual content without metadata from Tika via SolrCell

Using Solr 3.6 and the ExtractionRequestHandler (aka Tika), is it possible to map just the textual content (of a PDF) to a field minus the metadata? The "content" field produced by Tika unfortunately contains all the metadata munged in with the text content of the document.

I would like to provide some snippet highlighting of the content and the subject metadata within the content field is skewing the highlight results.

UPDATE: Screenshot of Tika output as indexed by Solr. Highlighted portion is the block of metadata that gets prepended as a block of text to the PDF content.

solr screenshot of tika output

The ExtractingRequestHandler in solrconfig.xml:

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler">
    <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>
    </lst>
</requestHandler>

Schema.xml fields. Note "content" receives Tika's content output directly. The "page" and "collection" fields are set with literal values when a doc is posted to the handler.

<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="collection" type="text_general" indexed="true" stored="true"/>
<field name="page" type="tint" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
like image 865
Peaeater Avatar asked Feb 21 '23 06:02

Peaeater


1 Answers

As all other answers are completely irrelevant, I'll post mine:

I have experienced exactly the same problem as OP describes, (Solr 4.3.0, custom config, custom schema, etc. I'm not newbie or something and understand Solr internals pretty well)

This was my ERH config:

  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="uprefix">ignored_</str>
      <str name="fmap.a">ignored_</str>
      <str name="fmap.div">ignored_</str>
      <str name="fmap.content">text</str>
      <str name="captureAttr">false</str>

      <str name="lowernames">true</str>
      <bool name="ignoreTikaException">true</bool>
    </lst>
  </requestHandler>

It was basically configured to ignore everything except the content (i believe it's reasonable for many people).

After careful investigation i found out, that

<str name="captureAttr">false</str>

was the thing caused OP's issue. By default it is turned on, but i turned it off as i did not need it anyway. And that was my mistake. I have no idea why, but it causes Solr to put extracted attributes into fmap.content field altogether with extracted text.

So the solution is to turn it back on. Final ERH:

  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="uprefix">ignored_</str>
      <str name="fmap.a">ignored_</str>
      <str name="fmap.div">ignored_</str>
      <str name="fmap.content">text</str>
      <str name="captureAttr">true</str>

      <str name="lowernames">true</str>
      <bool name="ignoreTikaException">true</bool>
    </lst>
  </requestHandler>

Now, only extracted text is put to fmap.content field.

Unfortunately i have not found any piece of documentation which can explain this. Either bug or just stupid behavior

like image 192
illegal-immigrant Avatar answered Feb 25 '23 04:02

illegal-immigrant