Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rails: Performance issue with joining of records

I have the following setup with ActiveRecord and MySQL:

  1. User has many groups through memberships
  2. Group has many users through memberships

There is also an index by group_id and user_id described in schema.rb:

add_index "memberships", ["group_id", "user_id"], name: "uugj_index", using: :btree

3 different queries:

User.where(id: Membership.uniq.pluck(:user_id))

(3.8ms) SELECT DISTINCT memberships.user_id FROM memberships User Load (11.0ms) SELECT users.* FROM users WHERE users.id IN (1, 2...)

User.where(id: Membership.uniq.select(:user_id))

User Load (15.2ms) SELECT users.* FROM users WHERE users.id IN (SELECT DISTINCT memberships.user_id FROM memberships)

User.uniq.joins(:memberships)

User Load (135.1ms) SELECT DISTINCT users.* FROM users INNER JOIN memberships ON memberships.user_id = users.id

What is the best approach for doing this? Why the query with join is much slower?

like image 430
user3409950 Avatar asked Oct 14 '15 14:10

user3409950


People also ask

What is the difference between joins and includes in rails?

What is the difference between includes and joins? The most important concept to understand when using includes and joins is they both have their optimal use cases. Includes uses eager loading whereas joins uses lazy loading, both of which are powerful but can easily be abused to reduce or overkill performance.

What is join in rails?

You can use Active Record joins to query one model based on data from a related table. For example you can return specific categories based on each category's products. Technical note: joins in Rails runs an SQL 'inner join' operation and returns the records from the model you're operating on.


2 Answers

The first query is bad because it sucks all of the user ids into a Ruby array and then sends them back to the database. If you have a lot of users, that's a huge array and a huge amount of bandwidth, plus 2 roundtrips to the database instead of one. Furthermore, the database has no way to efficiently handle that huge array.

The second and third approaches are both efficient database-driven solutions (one is a subquery, and one is a join), but you need to have the proper index. You need an index on the memberships table on user_id.

add_index :memberships, :user_id

The index that you already have, would only be helpful if you wanted to find all of the users that belong to a particular group.

Update:

If you have a lot of columns and data in your users table, the DISTINCT users.* in the 3rd query is going to be fairly slow because MySQL has to compare a lot of data in order to ensure uniqueness.

To be clear: this is not intrinsic slowness with JOIN, it's slowness with DISTINCT. For example: Here is a way to avoid the DISTINCT and still use a JOIN:

SELECT users.* FROM users
INNER JOIN (SELECT DISTINCT memberships.user_id FROM memberships) AS user_ids
ON user_ids.user_id = users.id;

Given all of that, in this case, I believe the 2nd query is going to be the best approach for you. The 2nd query should be even faster than reported in your original results if you add the above index. Please retry the second approach, if you haven't done so yet since adding the index.

Although the 1st query has some slowness issues of its own, from your comment, it's clear that it is still faster than the 3rd query (at least, for your particular dataset). The trade-offs of these approaches is going to depend on your particular dataset in regards to how many users you have and how many memberships you have. Generally speaking, I believe the 1st approach is still the worst even if it ends up being faster.

Also, please note that the index I'm recommending is particularly designed for the three queries you listed in your question. If you have other kinds of queries against these tables, you may be better served by additional indexes, or possibly multi-column indexes, as @tata mentioned in his/her answer.

like image 156
Nathan Avatar answered Oct 12 '22 15:10

Nathan


The query with join is slow because it loads all columns from database despite of the fact that rails don't preload them this way. If you need preloading then you should use includes (or similar) instead. But includes will be even slower because it will construct objects for all associations. Also you should know that User.where.not(id: Membership.uniq.select(:user_id)) will return empty set in case when there is at least one membership with user_id equal to nil while the query with pluck will return the correct relation.

like image 38
pavlik4k Avatar answered Oct 12 '22 13:10

pavlik4k