What's the best approach for using SOLR with web projects?

Question

ok, I'm totally new to SOLR and Lucene, but have got Solr running out-of-the-box under Tomcat 6.x and have just gone over some of the basic Wiki entries.

I have a few questions, and require some suggestions too.

Solr can index data in files (XML, CSV) and it can also index DBs. Can you also just point it to a URI/domain, and have it index a website in the way google would?
If I have a website with "Pages" data, so "Page Name", "Page Content" etc, and "Products Data", so "Product Name", "SKU" etc, do I need two different Schema.xml files? and if so, does that mean two different instances of Solr?

Finally, if you have a project with a large relational and normalized database, what would you say is the best approach from the 3 options below?:

Have a middleware service running in the background, which mines the DB and manually creates the relevant XML files to then send to SOLR
Have SOLR index the DB directly. In this case, would it be best to just point SOLR to views, which would abstract all the table relationships?
Any other options I'm unaware of?

Context: We're running in a Windows 2003 environment, .NET 3.5, SQLServer 2005/2008

cheers!

Mauricio Scheffer · Accepted Answer

No, you need a crawler for that, e.g. Nutch
Yes, you want two separate indexes ( = two schema.xml) since the datasets don't seem to be related. This doesn't mean two instances of Solr, you can manage the two indexes with Cores.

As for populating the Solr index, it depends on your particular project, for example, can it tolerate stale data or does it have to absolutely fresh.

Other options to index data include:

Database triggers
If you're using some sort of ORM use its interception capabilities. For example you can use NHibernate events to update the index on update, insert or delete. If you use NHibernate and SolrNet this is taken care of automatically

Eric Pugh · Answer

I think Mauricio is dead on for his advice. The only point I would make is that when deciding to have a "middleware" indexer, or use the database directly. If your database (or the views?) map very closely to what a good Solr schema wants, then DIH is great. But, if you are indexing from multiple sources of data, or if you have to munge about the data in your database to meet what Solr would like, then having a dedicated middleware indexer is better.

What's the best approach for using SOLR with web projects?

Tags:

indexing

search

solr

andy

2 Answers

Mauricio Scheffer

Eric Pugh

Recent Activity

Donate For Us

What's the best approach for using SOLR with web projects?

Tags:

indexing

search

solr

andy

2 Answers

Mauricio Scheffer

Eric Pugh

Related questions

Recent Activity

Donate For Us