Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ActiveRecord - Denormalization Case Study

What's the best way to deal with the 8 different SQL questions below.

I have placed below a database schema, how it is represented in my Rails models, and seven questions for data I need to get out of my database. Some questions I have answered, others I'm not sure of the best solution.

Question #7 is a curve ball, because it potentially changes the answers to all the other questions.

Criteria

  1. Shouldn't require n+1 queries. Multiple queries are okay, but if every row returned requires an additional query, it's not scalable.
  2. Shouldn't require post-processing to filter results that SQL can do on its own. For example, the answer to number five shouldn't be to pull ALL students from the data store, then remove those with no Courses.
  3. Retrieving a count on an object shouldn't trigger another SQL query.
  4. Shouldn't have to add a database column via denormalization if SQL allows me to aggregate the data
  5. Would a NOSQL solution, such as MongoDB or CouchDB, be better suited to answer all the questions below?

Database Schema

Students
-------
ID
Name

Courses
-----
ID
Name
Grade

Enrollments
----------
ID
Student_ID
Course_ID

ActiveRecord Models


class Course < ActiveRecord::Base
  has_many :enrollments
  has_many :students, :through=>:enrollments
end
class Enrollment < ActiveRecord::Base
  belongs_to :student
  belongs_to :course
end
class Student < ActiveRecord::Base
  has_many :enrollments
  has_many :courses, :through => :enrollments
end

Questions

1) Retrieve all students in the 9th Grade Math Course

SQL


SELECT s.* FROM Students s
LEFT JOIN Enrollments e on e.student_id = s.id
LEFT JOIN Courses c on e.course_id = c.id
WHERE c.grade = 9 AND c.name = 'Math'

Solution

This one is simple. ActiveRecord handles this well


c = Course.where(:grade=>9).where(:name=>'Math').first
c.students

2) Retrieve all Courses taken by John

SQL


SELECT c.* FROM Courses c
LEFT JOIN Enrollments e on c.id = e.course_id
LEFT JOIN Students s on e.student_id = s.id
WHERE s.name = 'John'

Solution

Again, simple.


s = Student.where(:name=>'John').first
s.courses

3) Retrieve all 9th grade Courses along with the number of students taking the course (but don't retrieve the students)

SQL


SELECT c.*, count(e.student_id) FROM Courses C
LEFT JOIN Enrollments e on c.id = e.course_id
WHERE c.grade = 9 GROUP BY c.id

Solution

Counter Cache will work nicely here.

class AddCounters < ActiveRecord::Migration
  def up
    add_column :students, :courses_count, :integer, :default=>0
    add_column :courses, :students_count, :integer, :default=>0
    Student.reset_column_information
    Student.all.each do |s|
      Student.update_counters s.id, :courses_count => s.courses.length
    end
    Course.reset_column_information
    Course.all.each do |c|
      Course.update_counters c.id, :students_count => c.students.length
    end
  end

  def down
    remove_column :students, :courses_count
    remove_column :courses, :students_count
  end
end

ActiveRecord

Course.where(:grade=>9).each do |c|
  puts "#{c.name} - #{c.students.size}"
end

4) Retrieve all students taking at least three 11th Grade Courses, more than one 10th Grade Courses, and no 9th grade courses

NO Solution

Not sure of the best solution. This would be VERY messy to do in SQL without keeping a counter cache for number of courses per grade level on each student. I could add a hook to update this information myself. I don't want to pull all students and courses and count them in post processing.

Slow Solution

The following solution produces a lot of queries. Preloading the courses may not be possible. (For example, the students are coming from the association on a course)


students = some_course.students
matching_students = []
students.each do |s|
  courses_9 = 0
  courses_10 = 0
  courses_11 = 0
  s.courses.each do |c|
    courses_9  += 1 if c.grade == 9
    courses_10 += 1 if c.grade == 10
    courses_11 += 1 if c.grade == 11
  end
  if courses_11 <= 3 && courses_10 > 1 && courses_9 == 0
    matching_students << s
  end
end
return matching_students

5) Retrieve all students who are taking more than one math course query)

SQL


SELECT s.*, count(e.course_id) as num_Courses FROM Students s
INNER JOIN Enrollments e on s.id = e.student_id
INNER JOIN Courses c on e.course_id = c.id AND c.name = 'Math'
GROUP BY s.id HAVING num_Courses > 0

Or


SELECT DISTINCT s.* FROM Students s
INNER JOIN Enrollments e_math_1 on e_math_1.student_id = s.id
INNER JOIN Courses c_math_1 ON e_math_1.course_id = c_math_1.id AND c_math_1.name = 'Math'
INNER JOIN Enrollments e_math_2 on e_math_2.student_id = s.id
INNER JOIN Courses c_math_2 ON e_math_2.course_id = c_math_2.id AND c_math_2.name = 'Math'
WHERE c_math_1.id != c_math_2.id

NO Solution

Not sure of the best solution. The tricky part to this is that the ActiveRecord (or NoSQL) solution can't retrieve all students, and looking at their courses afterwards, because that would be too slow.

Slow Solution


students = SomeObject.students
multiple_math_course_students = []
students.each do |s|
  has_math_course = false
  add_student = false
  s.courses.each do |c|
    if c.name == 'Math'
      if has_math_course
        add_student = true
      else
        has_math_course = true
      end
    end
  end
  multiple_math_course_students << s if add_student
end

6) Retrieve all students who are taking a Math And Science course

SQL


SELECT s.* FROM Students s
INNER JOIN Enrollments e_math on e_math.student_id = s.id
INNER JOIN Courses c_math ON e_math.course_id = c_math.id
INNER JOIN Enrollments e_science on e_science.student_id = s.id
INNER JOIN Courses c_science on e_science.course_id = c_science.id WHERE c_math.name = 'Math' AND c_science.name = 'Science'

NO Solution

This involves joining to the same table (or in Rails, association) twice. Is there a way to do this smoothly with ActiveRecord's AREL wrapper? You could make a separate association for science classes and math classes, allowing you to do separate operations on each, but this won't work in the case of #7 below.

Slow Solution


students = SomeObject.students
math_and_science_students = []
students.each do |s|
  has_math_course = false
  has_science_course = false
  s.courses.each do |c|
    has_math_course = true if c.name == 'Math'
    has_science_course = true if c.name == 'Science'
  end
  math_and_science_students << s if has_math_course && has_science_course
end

7) The customer has stated that anytime a student is shown in the system, display a number next to the student that shows the highest grade level course they are taking. For example, if Suzie is taking a 9th grade science course and a 10th grade math course, display a '10' next to Suzie.

Solution

It would not be acceptable to query the database for every student record. A page which displays 100 students would require 100 queries. At this point, I want to denormalize the database by putting a flag in the student table with "highest level course". Is this my best course of action? Would it be better to use a different data store other than a relational database from the start?

Imagine that the customer asked for any arbitrary data to be shown as a badge: Highest Grade Level, Number of Math Courses Taken, Gold Badge if taking Math, Science and History all together, etc. Should each of these cases be a call for denormalization of the database? Should denormalized data be kept in the same relational database as normalized data?

like image 391
Dave Avatar asked Sep 04 '12 21:09

Dave


People also ask

What is denormalization with example?

The goal of denormalization is to move data from normalized tables back into a single table to have the data where it is needed. For example, if a query joins multiple tables to get the data but indexing is not sufficient, denormalizing may be better.

Which type of database is suitable for denormalization?

Denormalization has a place with SQL and NoSQL databases, as well as in data warehousing.

Does denormalization reduce data redundancy?

Redundant data is eliminated when normalization is performed whereas denormalization increases the redundant data. Normalization increases the number of tables and joins.

When should you Denormalize database design?

Only if you need your database to perform better at particular tasks (such as reporting) should you opt for denormalization. If you do denormalize, be careful and make sure to document all changes you make to the database.


2 Answers

First, I think your database schema is fine. I would NOT de-normalize based upon these use cases, as they are very common.

Second, you have to learn to distinguish between Persistence, business logic and reports. ActiveRecord is good for basic persistence and encapsulating business logic. It handles the CRUD stuff and lets you put a lot of the logic of your application in the model. However, a lot of the logic you are talking about sounds like reports, especially #6. You are going to have to accept that for some kind of querying logic like this, raw SQL is going to be your best bet. I think the cache counters you have implemented might help you stay in active record and models if you are more comfortable there, but most likely you will have to drop to plain sql as you have done for several of these solutions. Reports in general require straight sql.

A normalized database is crucial to good application design. Its is really important for making your code clean for OLTP transaction and business logic. Don't denormalize just because you have to do some joins in sql. That is what sql is good at. All you are going to do by denormalizing is making some of your reporting logic faster and easier at the expensive of making your persistence and OLTP logic slower and harder.

So i would start out keeping your normalized database. If you need to join on a related table you can often use activerecord's include method to do this without resorting to regular sql. To do things like counts based on joins you'll have to use plain sql.

Eventually, if your database gets very large with lots of data, your reports will be slow because of all the joins you'll have to do. This is FINE. AT that point and no sooner, start considering making a separate reporting database that is denormalized that you can update hourly, nightly, weekly etc from the normalized database. Then move your reporting logic to query the reporting database without having to do joins. There is no need to start off this way however. You're just incurring extra complexity and expense without being certain of the payoff. Maybe your reporting sql with joins will work indefinitely without denormalization with the use of indexes. Don't prematurely optimize.

I don't think nosql is necessarily the answer either. From what little I know, NoSQL works well for specific uses cases. your application's uses cases and schema seem suited fine to relational databases.

Overall, I think the combination of raw sql (not arel/activerecord) and counters you have implemented are fine.

like image 180
fregas Avatar answered Oct 18 '22 18:10

fregas


I'm running into the same issue at the moment. From my research there are a few ways to get around it.

First of all, I believe any application will run into these issues. The basic idea is that we model our data in a normalized fashion which inherently becomes slow and cumbersome when there's a lot of data and the data span across multiple tables.

The best approach I've been able to come up with is the following:

  1. Model the problem as close to the real world thing you're working on
  2. Normalize as needed

These two should give a lot of flexibility for the application and provide many convenience methods as well as solve most of the questions I'm trying to answer

Once I need to do a bunch of joins to get what I need and I feel like I should denormalize the tables to easily get to what I need, I consider the following:

SQL views: These are pre-defined sql statements, joins for example, to which I can link models to. Generally this is way faster than querying via ActiveRecord http://hashrocket.com/blog/posts/sql-views-and-activerecord

aggregate table: Create one or more aggregate tables and update asynchronously using delayed_job, resque for example. These aggregate can get updated once a day for example and the model can query it directly. Note that this is some sort of denormalized table.

Couchbase (NOSQL) I haven't used this one but it looks very interesting. http://couchbaseonrails.com/understand

like image 1
lsaffie Avatar answered Oct 18 '22 19:10

lsaffie