Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MapReduce using SQL Server as data source

I'm currently investigating the possibility of using MapReduce to maintain incremental view builds in SQL Server.

Basically, use MapReduce to create materialized views.

I'm a bit stuck atm. thinking about how to partition my map outputs. Now, I don't really have a BigData situation, with roughly 50 GB being the max but I have a lot complexity and sort-of implied performance problems. I want to see if this MapReduce/NoSQL approach of mine might pan out.

The thing about MapReduce I'm currently having my issues with is the partitioning. Since I'm using SQL Server as data source, data locality isn't really a problem of mine and thus I don't need to send data all over the place, rather, each worker should be able to retrieve a partition of the data base on the map definition.

I intend to fully map the data through the use of LINQ and maybe something like Entity Framework, just to provide a familiar interface, this is somewhat besides the point, but it's the current route I'm exploring.

Now, how do I split my data? I have a primary key, I have map and reduce definitions in terms of expression trees (ASTs, if you unfamiliar with LINQ).

  • Firstly, how do I devise a way for me to split the entire input and partitioning the initial problem (I'm thinking I should be able to leverage window aggregates in SQL Server such as ROW_NUMBER and TILE).

  • Secondly, and more importantly, how do I make sure that I do this incrementally? That is, if I add, or make a change to the original problem, how do I effectively ensure that I minimize the amount of re-computations that need to take place?

I've been looking at CouchDB for inspiration and they seem to have a way to do this, but how do I leverage some of that goodness using SQL Server?

like image 583
John Leidegren Avatar asked Oct 26 '11 14:10

John Leidegren


1 Answers

I am facing something similar. I think you should forget windowing functions as it makes your process serialized. In other words all workers will be waiting for the query.

What we have tested and it's 'working' is to partition data into more tables (every month has its x tables) and run separate analytical threads on such partitions. Marking processed/unprocessed/possibly bad/etc data after Reduce.

Tests with one partitioned table brought as a locking escalation issues..

You'll definitely add a little bit more complexity to your current solution.

like image 140
pavel242 Avatar answered Oct 22 '22 14:10

pavel242