Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best approach for using SOLR with web projects?

ok, I'm totally new to SOLR and Lucene, but have got Solr running out-of-the-box under Tomcat 6.x and have just gone over some of the basic Wiki entries.

I have a few questions, and require some suggestions too.

  1. Solr can index data in files (XML, CSV) and it can also index DBs. Can you also just point it to a URI/domain, and have it index a website in the way google would?

  2. If I have a website with "Pages" data, so "Page Name", "Page Content" etc, and "Products Data", so "Product Name", "SKU" etc, do I need two different Schema.xml files? and if so, does that mean two different instances of Solr?

Finally, if you have a project with a large relational and normalized database, what would you say is the best approach from the 3 options below?:

  1. Have a middleware service running in the background, which mines the DB and manually creates the relevant XML files to then send to SOLR

  2. Have SOLR index the DB directly. In this case, would it be best to just point SOLR to views, which would abstract all the table relationships?

  3. Any other options I'm unaware of?

Context: We're running in a Windows 2003 environment, .NET 3.5, SQLServer 2005/2008

cheers!

like image 479
andy Avatar asked Nov 10 '09 02:11

andy


2 Answers

  1. No, you need a crawler for that, e.g. Nutch
  2. Yes, you want two separate indexes ( = two schema.xml) since the datasets don't seem to be related. This doesn't mean two instances of Solr, you can manage the two indexes with Cores.

As for populating the Solr index, it depends on your particular project, for example, can it tolerate stale data or does it have to absolutely fresh.

Other options to index data include:

  • Database triggers
  • If you're using some sort of ORM use its interception capabilities. For example you can use NHibernate events to update the index on update, insert or delete. If you use NHibernate and SolrNet this is taken care of automatically
like image 133
Mauricio Scheffer Avatar answered Sep 30 '22 16:09

Mauricio Scheffer


I think Mauricio is dead on for his advice. The only point I would make is that when deciding to have a "middleware" indexer, or use the database directly. If your database (or the views?) map very closely to what a good Solr schema wants, then DIH is great. But, if you are indexing from multiple sources of data, or if you have to munge about the data in your database to meet what Solr would like, then having a dedicated middleware indexer is better.

like image 45
Eric Pugh Avatar answered Sep 30 '22 17:09

Eric Pugh