case studies or examples of high throughput services with highly dynamic data

Question

I'm looking for some architecture ideas on a problem at work that I may have to solve.

the problem.
1) our enterprise LDAP has become a "contact master" filled with years of stale data and unused and unmaintained attributes.
2) management has decided that LDAP will no longer serve as a company phone book. it is for authorization purposes only.
3) the company has contact type data about people in hundreds of different sources. we need to scrub all the junk out of LDAP and give the other applications a central repo to store all this data about a person.

the ideal goal
1) have a single source to store all the various attributes about a person
2) the company probably has info on 500k people ( read 500K rows)
3) i estimate there could be 500 to 1000 optional attributes on these people. (read 500+ columns)
4) data would primarily be set/get via xml over jms (this infrastructure is already in place)
5) individual groups within the company could "own" columns. only they would be allowed to write to their columns, they would be responsible for keeping the data clean.
6) a single record lookup should be returned in sub seconds
7) system should support 1 million requests per hour at peak.
8) the primary goal is to serve real time data to the enterprise, reporting is a secondary goal.
9) we are a java, oracle, terradata shop. we are your typical big IT shop.

my thoughts:
1) originally i thought LDAP might work, but it doesn't scale when new columns are added.
2) my next thought was some kind of no-sql solution, but from what i have read, I don't think i cant get the performance I need, and its still relatively new. I'm not sure i can get my manager to sign off on something like that for such a critical project.
3) i think there will be a meta-data component to the solution that will track who owns the columns and what each column represents, and the original source system.

Thanks for reading, and thanks in advance for any thoughts.

cbednarski · Accepted Answer

SQL

With Teradata-grade tools an SQL-based solution may be feasible. I came across an article on database design awhile ago that discussed "anchor modeling".

Basically, the idea is to create a single, dumb, synthetic primary key table, while all real or meta data lives in other tables (subsets) and is attached by way of a foreign key + join.

I see the benefit of this design being two-fold. First, you can more easily compartmentalize data storage either for organizational or performance reasons. Second, you only create additional rows for records that have data in any given subset, so you use less space and indexing and searching are faster.

Subsets might be based on maintainer or some other criteria. XML set/get would be per-subset/record (rather than global record). All subsets for a given records can be composited and cached. Additional subsets can be created for metadata, search indexes, etc., and these can be queried independently.

NoSQL

NoSQL seems similar to LDAP (in theory, at least) but the benefit of a good NoSQL tool would include greater abstraction of metadata, versioning, and organization. In fact, from what I've read it seems that NoSQL datastores are designed to address some of the issues you've raised with respect to scaling and loosely structured data. There's a good question on SO regarding datastores.

Production NoSQL

Off-hand, there are a handful of large companies using NoSQL in massively-scaled environments, such as Google's Bigtable. It seems like the perfect tool for:

6) a single record lookup should be returned in sub seconds
7) system should support 1 million requests per hour at peak.

Bigtable is only available (to my knowledge) through AppEngine. Other, similar technologies are listed here.

Other Thoughts

The bigger picture view looks more or less the same regardless of the technology you decide to use. E.g. compartmentalize storage, composite views, cache views, stick metadata somewhere so you can find things.

The performance characteristics you're targeting are going to require some kind of caching and/or optimization based on real-world usage patterns. Regardless of the solution you choose, you probably can't resolve that in the design phase.

Seth · Answer

A couple thoughts:

1) our enterprise LDAP has become a "contact master" filled with years of stale data and unused and unmaintained attributes.

This isn't really a technological problem. You will have this problem with a new system as well, LDAP or not.

"LDAP ... doesn't scale"

There are lots of huge LDAP systems out there. LDAP is surely a dark art, but I'd willing to bet that it scales better than any SQL equivalent in this situation. Not to mention that LDAP is a standard for this kind of info, and as such it is accessible from zillions of different kinds of systems.

Maybe what you're looking for is a new LDAP system that's easier to manage / has better admin tools?

case studies or examples of high throughput services with highly dynamic data

Tags:

web-services

scalability

database-design

clarson

2 Answers

SQL

NoSQL

Production NoSQL

Other Thoughts

cbednarski

Seth

Recent Activity

Donate For Us

case studies or examples of high throughput services with highly dynamic data

Tags:

web-services

scalability

database-design

clarson

2 Answers

SQL

NoSQL

Production NoSQL

Other Thoughts

cbednarski

Seth

Related questions

Recent Activity

Donate For Us