Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Modelling NoSQL database (when converting from SQL database)

Tags:

sql

nosql

ravendb

I have a SQL database that I want to convert to a NoSQL one (currently I'm using RavenDB)

Here are my tables:

Trace:

ID (PK, bigint, not null)
DeploymentID (FK, int, not null)
AppCode (int, not null)

Deployment:

DeploymentID (PK, int, not null)
DeploymentVersion (varchar(10), not null)
DeploymentName (nvarchar(max), not null)

Application:

AppID (PK, int, not null)
AppName (nvarchar(max), not null)

Currently I have these rows in my tables:

Trace:

ID: 1 , DeploymentID: 1, AppCode: 1
ID: 2 , DeploymentID: 1, AppCode: 2 
ID: 3 , DeploymentID: 1, AppCode: 3 
ID: 3 , DeploymentID: 2, AppCode: 1

Deployment:

DeploymentID: 1 , DeploymentVersion: 1.0, DeploymentName: "Test1"
DeploymentID: 2 , DeploymentVersion: 1.0, DeploymentName: "Test2"

Application:

AppID: 1 , AppName: "Test1"
AppID: 2 , AppName: "Test2"
AppID: 3 , AppName: "Test3"

My question is: HOW should I build my NoSQL document model ?

Should it look like:

trace/1
{
 "Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test1" } ],
 "Application": "Test1"
}

trace/2
{
 "Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test1" } ],
 "Application": "Test2"
}

trace/3
{
 "Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test1" } ],
 "Application": "Test3"
}

trace/4    
{
 "Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test2" } ],
 "Application": "Test1"
}

And what if Deployment 1 gets changed ? Should I go by each document and change the data?

And when should I use references in NoSQL ?

like image 837
ohadinho Avatar asked Apr 30 '13 08:04

ohadinho


2 Answers

Document databases such as Raven are not relational databases. You CANNOT first build the database model and then later on decide on various interesting ways of querying it. Instead, you should first determine what access patterns you want to support, and then design the document schemas accordingly.

So in order to answer your question, what we really need to know is how you intend to use the data. For example, displaying all traces ordered by time is a distinctly different scenario than displaying traces associated a specific deployment or application. Each one of those requirements will dictate a different design, as will supporting them both.

This in itself may be useful information to you (?), but I suspect you want more concrete answers :) So please add some additional details on your intended usage.

There are a few "do" and "don'ts" when deciding on a strategy:

DO: Optimize for the common use-cases. There is often a 20/80 breakdown where 20% of the UX drives 80% of the load - the homepage/landing page of web apps is a classic example. First priority is to make sure that these are as efficient as possible. Make sure that your data model allows either A) loading those in either a single IO request or B) is cache-friendly

DONT: don't fall into the dreaded "N+1" trap. This pattern occurs when you data model forces you to make N calls in order to load N entities, often preceded by an additional call to get the list of the N IDs. This is a killer, especially together with #3...

DO: Always cap (via the UX) the amount of data which you are willing to fetch. If the user has 3729 comments you obviously aren't going to fetch them all at once. Even it it was feasible from a database perspective, the user experience would be horrible. Thats why search engines use the "next 20 results" paradigm. So you can (for example) align the database structure to the UX and save the comments in blocks of 20. Then each page refresh involves a single DB get.

DO: Balance the Read and Write requirements. Some types of systems are read-heavy and you can assume that for each write there will be many reads (StackOverflow is a good example). So there it makes sense to make writes more expensive in order to gain benefits in read performance. For example, data denormalization and duplication. Other systems are evenly balanced or even write heavy and require other approaches

DO: Use the dimension of TIME to your advantage. Twitter is a classic example: 99.99% of tweets will never be accessed after the first hour/day/week/whatever. That opens all kinds of interesting optimization possibilities in the your data schema.

This is just the tip of the iceberg. I suggest reading up a little on column-based NoSQL systems (such as Cassandra)

like image 50
Addys Avatar answered Oct 03 '22 06:10

Addys


How you model your documents depends mostly on your application and it's domain. From there, the document model can be refined by understanding your data access patterns.

Blindly attempting to map a relational data model to a non-relational one is probably not a good idea.

UPDATE: I think Matt got the main idea of my point here. What I am trying to say is that there is no prescribed method (that I am aware of anyway) to translate a relational data model (like a normalized SQL Schema) to a non-relational data model (like a document model) without understanding and considering the domain of the application. Let me elaborate a bit here...

After looking at your SQL schema, I have no idea what a trace is besides a table that appears to join Applications and Deployments. I also have no idea how your application typically queries the data. Knowing a little about this makes a difference when you model your documents, just as it would make a difference in the way you model your application objects (or domain objects).

So the document model suggested in your question may or may not work for you application.

like image 42
Alonso Robles Avatar answered Oct 03 '22 07:10

Alonso Robles