Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NoSQL & AdHoc Queries - Millions of Rows

I currently run a MySQL-powered website where users promote advertisements and gain revenue every time someone completes one. We log every time someone views an ad ("impression"), every time a user clicks an add ("click"), and every time someone completes an ad ("lead").

Since we get so much traffic, we have millions of records in each of these respective tables. We then have to query these tables to let users see how much they have earned, so we end up performing multiple queries on tables with millions and millions of rows multiple times in one request, hundreds of times concurrently.

We're looking to move away from MySQL and to a key-value store or something along those lines. We need something that will let us store all these millions of rows, query them in milliseconds, and MOST IMPORTANTLY, use adhoc queries where we can query any single column, so we could do things like:

FROM leads WHERE country = 'US' AND user_id = 501 (the NoSQL equivalent, obviously)

FROM clicks WHERE ad_id = 1952 AND user_id = 200 AND country = 'GB'

etc.

Does anyone have any good suggestions? I was considering MongoDB or CouchDB but I'm not sure if they can handle querying millions of records multiple times a second and the type of adhoc queries we need.

Thanks!

like image 364
Paul B Avatar asked Jul 04 '11 22:07

Paul B


4 Answers

CouchDB's map/reduce is incremental which means it only processes a document once and stores the results.

Let's assume, for a moment, that CouchDB is the slowest database in the world. Your first query with millions of rows takes, maybe, 20 hours. That sounds terrible. However, your second query, your third query, your fourth query, and your hundredth query will take 50 milliseconds, perhaps 100 including HTTP and network latency.

You could say CouchDB fails the benchmarks but gets honors in the school of hard knocks.

I would not worry about performance, but rather if CouchDB can satisfy your ad-hoc query requirements. CouchDB wants to know what queries will occur, so it can do the hard work up-front before the query arrives. When the query does arrive, the answer is already prepared and out it goes!

All of your examples are possible with CouchDB. A so-called merge-join (lots of equality conditions) is no problem. However CouchDB cannot support multiple inequality queries simultaneously. You cannot ask CouchDB, in a single query, for users between age 18-40 who also clicked fewer than 10 times.

The nice thing about CouchDB's HTTP and Javascript interface is, it's easy to do a quick feasibility study. I suggest you try it out!

like image 41
JasonSmith Avatar answered Sep 23 '22 20:09

JasonSmith


Most people would probably recommend MongoDB for a tracking/analytic system like this, for good reasons. You should read the „MongoDB for Real-Time Analytics” chapter from the „MongoDB Definitive Guide” book. Depending on the size of your data and scaling needs, you could get all the performance, schema-free storage and ad-hoc querying features. You will need to decide for yourself if issues with durability and unpredictability of the system are risky for you or not.

For a simpler tracking system, Redis would be a very good choice, offering rich functionality, blazing speed and real durability. To get a feel how such a system would be implemented in Redis, see this gist. The downside is, that you'd need to define all the „indices” by yourself, not gain them for „free”, as is the case with MongoDB. Nevertheless, there's no free lunch, and MongoDB indices are definitely not a free lunch.

I think you should have a look into how ElasticSearch would enable you:

  • Blazing speed
  • Schema-free storage
  • Sharding and distributed architecture
  • Powerful analytic primitives in the form of facets
  • Easy implementation of „sliding window”-type of data storage with index aliases

It is in heart a „fulltext search engine”, but don't get yourself confused by that. Read the „Data Visualization with ElasticSearch and Protovis“ article for real world use case of ElasticSearch as a data mining engine.

Have a look on these slides for real world use case for „sliding window” scenario.

There are many client libraries for ElasticSearch available, such as Tire for Ruby, so it's easy to get off the ground with a prototype quickly.

For the record (with all due respect to @jhs :), based on my experience, I cannot imagine an implementation where Couchdb is a feasible and useful option. It would be an awesome backup storage for your data, though.

like image 40
karmi Avatar answered Sep 19 '22 20:09

karmi


If your working set can fit in the memory, and you index the right fields in the document, you'd be all set. Your ask is not something very typical and I am sure with proper hardware, right collection design (denormalize!) and indexing you should be good to go. Read up on Mongo querying, and use explain() to test the queries. Stay away from IN and NOT IN clauses that'd be my suggestion.

like image 43
lobster1234 Avatar answered Sep 21 '22 20:09

lobster1234


With those requirements, you are probably better off sticking with SQL and setting up replication/clustering if you are running into load issues. You can set up indexing on a document database so that those queries are possible, but you don't really gain anything over your current system.

NoSQL systems generally improve performance by leaving out some of the more complex features of relational systems. This means that they will only help if your scenario doesn't require those features. Running ad hoc queries on tabular data is exactly what SQL was designed for.

like image 176
Tom Clarkson Avatar answered Sep 22 '22 20:09

Tom Clarkson