I have a database of URLs that I would like to search. Because URLs are not always written the same (may or may not have www), I am looking for the correct way to Index and Query urls. I've tried a few things, and I think I'm close but not sure why it doesn't work:
Here is my custom field type:
<fieldType name="customUrlType" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
For example:
http://www.twitter.com/AndersonCooper when indexed, will have the following words in different positions: http,www,twitter,com,andersoncooper
If I search for simply twitter.com/andersoncooper, I would like this query to match the record that was indexed, which is why I also use the WDF to split the search query, however the search query ends up being like so:
myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper
Is there a different query filter or tokenizer I should be using?
By adding content to an index, we make it searchable by Solr. A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.
You can search for "solr" by loading the Admin UI Query tab, enter "solr" in the q param (replacing *:* , which matches all documents), and "Execute Query". See the Searching section below for more information. To index your own data, re-run the directory indexing command pointed to your own directory of documents.
Start the Server If you are running Windows, you can start Solr by running bin\solr. cmd instead. This will start Solr in the background, listening on port 8983. When you start Solr in the background, the script will wait to make sure Solr starts correctly before returning to the command line prompt.
The correct way is the copyField you have and declaring the field all as the default search field. That's how the examples that ship with Solr out of the box do it. Excellent, adding <str name="df">all</str> to defaults in solrconfig. xml indeed solved this.
If I understand this statement from your question
myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper
You are trying to write a query that would match both:
http://www.twitter.com/AndersonCooper
and
http://www.andersoncooper.com/socialmedia/twitter
(both links contain all of the tokens), but not match either
http://www.facebook.com/AndersonCooper
or
http://www.twitter.com/AliceCooper
If that is correct, your existing configuration should work just fine. Assuming that you are using the standard query parser and you are querying via curl or some other url based mechanism, you need the query parameter to look like this:
&q=myField:andersoncooper AND myField:twitter AND myField:com
One of the gotchas that may have been tripping you up is that the default query operator (between terms in a query) is "OR", which is why the AND's must be explicitly specified above. Alternately to save some space, you can change the default query operator to "AND" like this:
&q.op=AND&q=myField:(andersoncooper twitter com)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With