Indexing and Querying URLS in Solr

Tags:

I have a database of URLs that I would like to search. Because URLs are not always written the same (may or may not have www), I am looking for the correct way to Index and Query urls. I've tried a few things, and I think I'm close but not sure why it doesn't work:

Here is my custom field type:

Click to copy

 <fieldType name="customUrlType" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

For example:

http://www.twitter.com/AndersonCooper when indexed, will have the following words in different positions: http,www,twitter,com,andersoncooper

If I search for simply twitter.com/andersoncooper, I would like this query to match the record that was indexed, which is why I also use the WDF to split the search query, however the search query ends up being like so:

myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper

Is there a different query filter or tokenizer I should be using?

753

asked Jan 13 '11 18:01

KidA78

1 Answers

If I understand this statement from your question

myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper

You are trying to write a query that would match both:

Click to copy

http://www.twitter.com/AndersonCooper

and

Click to copy

http://www.andersoncooper.com/socialmedia/twitter

(both links contain all of the tokens), but not match either

Click to copy

http://www.facebook.com/AndersonCooper

Click to copy

http://www.twitter.com/AliceCooper

If that is correct, your existing configuration should work just fine. Assuming that you are using the standard query parser and you are querying via curl or some other url based mechanism, you need the query parameter to look like this:

Click to copy

&q=myField:andersoncooper AND myField:twitter AND myField:com

One of the gotchas that may have been tripping you up is that the default query operator (between terms in a query) is "OR", which is why the AND's must be explicitly specified above. Alternately to save some space, you can change the default query operator to "AND" like this:

Click to copy

&q.op=AND&q=myField:(andersoncooper twitter com)

159

answered Oct 10 '22 04:10

Gus

Related questions
                            
                                Is it dangerous to leave your Django admin directory under the default url of admin?
                            
                                Form value creates a URL
                            
                                How to get current url without page in javascript or jquery
                            
                                Using regex to extract URLs from plain text with Perl
                            
                                How to save the content in UIWebView for faster loading on next launch?
                            
                                Download HTML Page in C#
                            
                                Getting the base path/URL
                            
                                how do I make ASPX Web pages without file extensions?
                            
                                How to avoid hyperlink creation when writing down URIs in markdown?
                            
                                Get URL of calling webpage in PHP
                            
                                Compare two strings (urls) for same domain
                            
                                URL(string:) Cannot call value of non-function type 'String'
                            
                                How to concatenate a Hash to URL parameters?
                            
                                How do I create userfriendly urls like stackoverflow?
                            
                                java.net.UnknownHostException on file:// method
                            
                                Load image to a tableView from URL iphone sdk
                            
                                Extract all urls inside a string in Ruby

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Indexing and Querying URLS in Solr

Tags:

url

indexing

solr

tokenize

querying

KidA78

People also ask

1 Answers

Gus

Recent Activity

Donate For Us