I'd love to know how Stack Overflow's tagging and search is architected, because it seems to work pretty well.
What is a good database/search model if I want to do all of the following:
- Storing Tags on various entities, (how normalized? i.e. Entity, Tag, and Entity_Tag tables?)
- Searching for items with particular tags
- Building a tag cloud of all tags that apply to a particular search result set
- How to show a tag list for each item in a search result?
Perhaps it makes sense to store the tags in a normalized form, but also as a space-delimited string for the purposes of #2, #4, and perhaps #3. Thoughts?
I have heard it said that Stack Overflow uses Lucene for search. Is that true? I've heard a couple of podcasts discussing SQL optimization, but nothing about Lucene. If they do use Lucene, I'm wondering how much of the search result comes from Lucene, and whether the "drill-down" tag cloud comes from Lucene.
Wow I just wrote a big post and SO choked and hung on it, and when I hit my back button to resubmit, the markup editor was empty. aaargh.
So here I go again...
Regarding Stack Overflow, it turns out that they use SQL server 2005 full text search.
Regarding the OS projects recommended by @Grant:
- *DotNetKicks uses the DB for tagging and Lucene for full-text search. There appears to be no way to combine a full text search with a tag search
- Kigg uses Linq-to-SQL for both search and tag queries. Both queries join Stories->StoryTags->Tags.
- Both projects have a 3-table approach to tagging as everyone generally seems to recommend
I also found some other questions on SO that I'd missed before:
- How Do You Recommend Implementing Tags or Tagging?
- How to structure data for searchability?
- Database Design for Tagging
What I'm currently doing for each of the items I mentioned:
- In the DB, 3 tables: Entity, Tag, Entity_Tag. I use the DB to:
- Build site-wide tag clouds
- browse by tag (i.e. urls like SO's /questions/tagged/ASP.NET)
- For search I use Lucene + NHibernate.Search
- Tags are concat'd into a TagString that is indexed by Lucene
- So I have the full power of the Lucene query engine (AND / OR / NOT queries)
- I can search for text and filter by tags at the same time
- The Lucene analyzer merges words for better tag searches (i.e. a tag search for "test" will also find stuff tagged "testing")
- Lucene returns a potentially enormous result set, which I paginate to 20 results
- Then NHibernate loads the result Entities by Id, either from the DB or the Entity cache
- So it's entirely possible that a search results in 0 hits to the DB
- Not doing this yet, but I think I will probably try to find a way to build the tag cloud from the TagString in Lucene, rather than take another DB hit
- Haven't done this yet either, but I will probably store the TagString in the DB so that I can show an Entity's Tag list without having to make 2 more joins.
This means that whenever an Entity's tags are modified, I have to:
- Insert any new Tags that do not already exist
- Insert/Delete from the EntityTag table
- Update Entity.TagString
- Update the Lucene index for the Entity
Given that the ratio of reads to writes is very big in my application, I think I'm ok with this. The only really time-consuming part is Lucene indexing, because Lucene can only insert and delete from its index, so I have to re-index the entire entity in order to update the TagString. I'm not excited about that, but I think that if I do it in a background thread, it will be fine.
Time will tell...