Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal data architecture for tagging, clouds, and searching (like StackOverflow)?

I'd love to know how Stack Overflow's tagging and search is architected, because it seems to work pretty well.

What is a good database/search model if I want to do all of the following:

  1. Storing Tags on various entities, (how normalized? i.e. Entity, Tag, and Entity_Tag tables?)
    • Searching for items with particular tags
    • Building a tag cloud of all tags that apply to a particular search result set
    • How to show a tag list for each item in a search result?

Perhaps it makes sense to store the tags in a normalized form, but also as a space-delimited string for the purposes of #2, #4, and perhaps #3. Thoughts?

I have heard it said that Stack Overflow uses Lucene for search. Is that true? I've heard a couple of podcasts discussing SQL optimization, but nothing about Lucene. If they do use Lucene, I'm wondering how much of the search result comes from Lucene, and whether the "drill-down" tag cloud comes from Lucene.

like image 642
Winston Fassett Avatar asked Oct 24 '08 18:10

Winston Fassett


1 Answers

Wow I just wrote a big post and SO choked and hung on it, and when I hit my back button to resubmit, the markup editor was empty. aaargh.

So here I go again...

Regarding Stack Overflow, it turns out that they use SQL server 2005 full text search.

Regarding the OS projects recommended by @Grant:

  • *DotNetKicks uses the DB for tagging and Lucene for full-text search. There appears to be no way to combine a full text search with a tag search
  • Kigg uses Linq-to-SQL for both search and tag queries. Both queries join Stories->StoryTags->Tags.
  • Both projects have a 3-table approach to tagging as everyone generally seems to recommend

I also found some other questions on SO that I'd missed before:

  • How Do You Recommend Implementing Tags or Tagging?
  • How to structure data for searchability?
  • Database Design for Tagging

What I'm currently doing for each of the items I mentioned:

  1. In the DB, 3 tables: Entity, Tag, Entity_Tag. I use the DB to:
    • Build site-wide tag clouds
    • browse by tag (i.e. urls like SO's /questions/tagged/ASP.NET)
  2. For search I use Lucene + NHibernate.Search
    • Tags are concat'd into a TagString that is indexed by Lucene
      • So I have the full power of the Lucene query engine (AND / OR / NOT queries)
      • I can search for text and filter by tags at the same time
      • The Lucene analyzer merges words for better tag searches (i.e. a tag search for "test" will also find stuff tagged "testing")
    • Lucene returns a potentially enormous result set, which I paginate to 20 results
    • Then NHibernate loads the result Entities by Id, either from the DB or the Entity cache
    • So it's entirely possible that a search results in 0 hits to the DB
  3. Not doing this yet, but I think I will probably try to find a way to build the tag cloud from the TagString in Lucene, rather than take another DB hit
  4. Haven't done this yet either, but I will probably store the TagString in the DB so that I can show an Entity's Tag list without having to make 2 more joins.

This means that whenever an Entity's tags are modified, I have to:

  • Insert any new Tags that do not already exist
  • Insert/Delete from the EntityTag table
  • Update Entity.TagString
  • Update the Lucene index for the Entity

Given that the ratio of reads to writes is very big in my application, I think I'm ok with this. The only really time-consuming part is Lucene indexing, because Lucene can only insert and delete from its index, so I have to re-index the entire entity in order to update the TagString. I'm not excited about that, but I think that if I do it in a background thread, it will be fine.

Time will tell...

like image 108
Winston Fassett Avatar answered Sep 23 '22 03:09

Winston Fassett