Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SOLR not searching on certain fields

Tags:

indexing

solr

Just installed Solr, edited the schema.xml, and am now trying to index it and search on it with some test data.

In the XML file I'm sending to Solr, one of my fields look like this:

<field name="PageContent"><![CDATA[<p>some text in a paragrah tag</p>]]></field>

There's HTML there, so I've wrapped it in CDATA.

In my Solr schema.xml, the definition for that field looks like this:

<field name="PageContent" type="text" indexed="true" stored="true"/>

When I ran the POSTing tool, everything went ok, but when I search for content which I know is inside the PageContent field, I get no results.

However, when I set the <defaultSearchField> node to PageContent, it works. But if I set it to any other field, it doesn't search in PageContent.

Am I doing something wrong? what's the issue?


To clarify on the error:

I've uploaded a "doc" with the following data:

<field name="PageID">928</field>
<field name="PageName">some name</field>
<field name="PageContent"><![CDATA[<p>html content</p>]]></field>

In my schema I've defined the fields as such:

<field name="PageID" type="integer" indexed="true" stored="true" required="true"/>
<field name="PageName" type="text" indexed="true" stored="true"/>
<field name="PageContent" type="text" indexed="true" stored="true"/>

And:

<uniqueKey>PageID</uniqueKey>
<defaultSearchField>PageName</defaultSearchField>

Now, when I use the Solr admin tool and search for "some name" I get a result. But, if I search for "html content", "html", "content" or "928", I get no results

Why?

like image 625
andy Avatar asked Nov 11 '09 05:11

andy


2 Answers

You mentioned that your default search field is set to PageName, I wouldn't expect a search for "content" to return anything.

You probably meant to put "PageContent:content" in the search box to find data in that field. If you want to search against multiple fields you'll want to check this out http://wiki.apache.org/solr/DisMaxRequestHandler. The solr admin console is not that great of a tool to play around with all the DisMax search options, you'll want to just manipulate the URL for that.

Regardless, I agree with the previous poster, if your analysis setup isn't setup up properly to deal with HTML you are likely to get all sorts of unexpected search results. Strip the HTML out and index text only.

If you want your standard query handler to search against all your fields you can change it in your solrconfig.xml (I always add a second query handler instead of modifying "standard". The qf field is the list of fields you want to search against. It's a space separated list.

<requestHandler name="standard" class="solr.DisMaxRequestHandler">

     <lst name="defaults">
            <str name="echoParams">all</str>
            <str name="hl">true</str>

            <str name="fl">*</str>
            <str name="qf">PageName PageContent</str>
     </lst>

 </requestHandler>
like image 133
Trey Avatar answered Sep 28 '22 08:09

Trey


You are making sure that your data has been committed before you attempt to search on it, right?

Also, if you want to store raw HTML its probably best to actually remove the HTML. You can do this in your application or using Solr's solr.HTMLStripWhitespaceTokenizerFactory, like:

<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> 

Which you declare in your fieldtype definition for "text". You might want to create a new field type just for your html, maybe something like text_html and you can use it like so:

<fieldtype name="text_html" class="solr.TextField" positionIncrementGap="100"> 
      <analyzer type="index"> 
          <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> 
          <filter class="solr.StopFilterFactory" ignoreCase="true"/> 
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/> 
          <filter class="solr.LowerCaseFilterFactory"/> 
          <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> 
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
      </analyzer> 
      <analyzer type="query"> 
          <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> 
          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> 
          <filter class="solr.StopFilterFactory" ignoreCase="true"/> 
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> 
          <filter class="solr.LowerCaseFilterFactory"/> 
          <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> 
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
      </analyzer> 
    </fieldtype> 

I am not sure what you mean by:

However, when I set the node to PageContent, it works. But if I set it to any other field, it doesn't search in PageContent.

Can you please elaborate?

like image 35
Cody Caughlan Avatar answered Sep 28 '22 08:09

Cody Caughlan