Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing and Querying URLS in Solr

I have a database of URLs that I would like to search. Because URLs are not always written the same (may or may not have www), I am looking for the correct way to Index and Query urls. I've tried a few things, and I think I'm close but not sure why it doesn't work:

Here is my custom field type:

 <fieldType name="customUrlType" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

For example:

http://www.twitter.com/AndersonCooper when indexed, will have the following words in different positions: http,www,twitter,com,andersoncooper

If I search for simply twitter.com/andersoncooper, I would like this query to match the record that was indexed, which is why I also use the WDF to split the search query, however the search query ends up being like so:

myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper

Is there a different query filter or tokenizer I should be using?

like image 753
KidA78 Avatar asked Jan 13 '11 18:01

KidA78


People also ask

What is indexing in Solr?

By adding content to an index, we make it searchable by Solr. A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.

How do I query in Solr collection?

You can search for "solr" by loading the Admin UI Query tab, enter "solr" in the q param (replacing *:* , which matches all documents), and "Execute Query". See the Searching section below for more information. To index your own data, re-run the directory indexing command pointed to your own directory of documents.

How do I run Solr indexing?

Start the Server If you are running Windows, you can start Solr by running bin\solr. cmd instead. This will start Solr in the background, listening on port 8983. When you start Solr in the background, the script will wait to make sure Solr starts correctly before returning to the command line prompt.

How query all fields in Solr?

The correct way is the copyField you have and declaring the field all as the default search field. That's how the examples that ship with Solr out of the box do it. Excellent, adding <str name="df">all</str> to defaults in solrconfig. xml indeed solved this.


1 Answers

If I understand this statement from your question

myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper

You are trying to write a query that would match both:

http://www.twitter.com/AndersonCooper

and

http://www.andersoncooper.com/socialmedia/twitter

(both links contain all of the tokens), but not match either

http://www.facebook.com/AndersonCooper 

or

http://www.twitter.com/AliceCooper

If that is correct, your existing configuration should work just fine. Assuming that you are using the standard query parser and you are querying via curl or some other url based mechanism, you need the query parameter to look like this:

&q=myField:andersoncooper AND myField:twitter AND myField:com

One of the gotchas that may have been tripping you up is that the default query operator (between terms in a query) is "OR", which is why the AND's must be explicitly specified above. Alternately to save some space, you can change the default query operator to "AND" like this:

&q.op=AND&q=myField:(andersoncooper twitter com)
like image 159
Gus Avatar answered Oct 10 '22 04:10

Gus