Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rails expanding fields with scope, PG does not like it

I have a model of Widgets. Widgets belong to a Store model, which belongs to an Area model, which belongs to a Company. At the Company model, I need to find all associated widgets. Easy:

class Widget < ActiveRecord::Base
  def self.in_company(company)
    includes(:store => {:area => :company}).where(:companies => {:id => company.id})
  end
end

Which will generate this beautiful query:

> Widget.in_company(Company.first).count

SQL (50.5ms)  SELECT COUNT(DISTINCT "widgets"."id") FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1
 => 15088 

But, I later need to use this scope in more complex scope. The problem is that AR is expanding the query by selecting individual fields, which fails in PG because selected fields must in the GROUP BY clause or the aggregate function.

Here is the more complex scope.

def self.sum_amount_chart_series(company, start_time)
  orders_by_day = Widget.in_company(company).archived.not_void.
                  where(:print_datetime => start_time.beginning_of_day..Time.zone.now.end_of_day).
                  group(pg_print_date_group).
                  select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")

end

def self.pg_print_date_group
  "CAST((print_datetime + interval '#{tz_offset_hours} hours') AS date)"
end

And this is the select it is throwing at PG:

> Widget.sum_amount_chart_series(Company.first, 1.day.ago)

SELECT "widgets"."id" AS t0_r0, "widgets"."user_id" AS t0_r1,<...BIG SNIP, YOU GET THE IDEA...> FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1 AND "widgets"."archived" = 't' AND "widgets"."voided" = 'f' AND ("widgets"."print_datetime" BETWEEN '2011-04-24 00:00:00.000000' AND '2011-04-25 23:59:59.999999') GROUP BY CAST((print_datetime + interval '-7 hours') AS date)

Which generates this error:

PGError: ERROR: column "widgets.id" must appear in the GROUP BY clause or be used in an aggregate function LINE 1: SELECT "widgets"."id" AS t0_r0, "widgets"."user_id...

How do I rewrite the Widget.in_company scope so that AR does not expand the select query to include every Widget model field?

like image 969
Karl Avatar asked Apr 25 '11 18:04

Karl


3 Answers

As Frank explained, PostgreSQL will reject any query which doesn't return a reproducible set of rows.

Suppose you've a query like:

select a, b, agg(c)
from tbl
group by a

PostgreSQL will reject it because b is left unspecified in the group by statement. Run that in MySQL, by contrast, and it will be accepted. In the latter case, however, fire up a few inserts, updates and deletes, and the order of the rows on disk pages ends up different.

If memory serves, implementation details are so that MySQL will actually sort by a, b and return the first b in the set. But as far as the SQL standard is concerned, the behavior is unspecified -- and sure enough, PostgreSQL does not always sort before running aggregate functions.

Potentially, this might result in different values of b in result set in PostgreSQL. And thus, PostgreSQL yields an error unless you're more specific:

select a, b, agg(c)
from tbl
group by a, b

What Frank highlighted is that, in PostgreSQL 9.1, if a is the primary key, than you can leave b unspecified -- the planner has been taught to ignore subsequent group by fields when applicable primary keys imply a unique row.

For your problem in particular, you need to specify your group by as you currently do, plus every field that you're basing your aggregate onto, i.e. "widgets"."id", "widgets"."user_id", [snip] but not stuff like sum(amount), which are the aggregate function calls.

As an off topic side note, I'm not sure how your ORM/model works but the SQL it's generating isn't optimal. Many of those left outer joins seem like they should be inner joins. This will result in allowing the planner to pick an appropriate join order where applicable.

like image 197
Denis de Bernardy Avatar answered Sep 24 '22 17:09

Denis de Bernardy


PostgreSQL version 9.1 (beta at this moment) might fix your problem, but only if there is a functional dependency on the primary key.

From the release notes:

Allow non-GROUP BY columns in the query target list when the primary key is specified in the GROUP BY clause (Peter Eisentraut)

Some other database system already allowed this behavior, and because of the primary key, the result is unambiguous.

You could run a test and see if it fixes your problem. If you can wait for the production release, this can fix the problem without changing your code.

like image 39
Frank Heikens Avatar answered Sep 24 '22 17:09

Frank Heikens


Firstly simplify your life by storing all dates in a standard time-zone. Changing dates with time-zones should really be done in the view as a user convenience. This alone should save you a lot of pain.

If you're already in production write a migration to create a normalised_date column wherever it would be helpful.

nrI propose that the other problem here is the use of raw SQL, which rails won't poke around for you. To avoid this try using the gem called Squeel (aka Metawhere 2) http://metautonomo.us/projects/squeel/

If you use this you should be able to remove hard coded SQL and let rails get back to doing its magic.

For example:

.select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")

becomes (once your remove the need for normalising the date):

.select{sum(amount).as(total_amount)}
like image 2
thomasfedb Avatar answered Sep 23 '22 17:09

thomasfedb