I'm using Rails 4.0.0 and Ruby 2.0.0. My Post
(as in blog posts) model is associated with a user with a combination of the user's user_name, first_name, last_name. I'd like to migrate the data so that posts are associated to users by a foreign key, which is the user's id.
I have about 11 million records in the posts
table.
I'm running the below code to migrate the data, using a rake task on a Linux server. However, my task keeps getting "Killed" by the sever, presumably due to the rake task, specifically the below code, consuming too much memory.
I've found that lowering the batch_size
to 20 and increasing sleep(10)
to sleep(60)
allows the task to run longer, updating more records in total without being Killed, but takes significantly more time.
How can I optimize this code for speed and memory usage?
Post.where(user_id: nil).find_in_batches(batch_size: 1000) do |posts|
puts "*** Updating batch beginning with post #{posts.first.id}..."
sleep(10) # Hopefully, saving some memory usage.
posts.each do |post|
begin
user = User.find_by(user_name: post.user_name, first_name: post.first_name, last_name: post.last_name)
post.update(user_id: user.id)
rescue NoMethodError => error # user could be nil, so user.id will raise a NoMethodError
puts "No user found."
end
end
puts "*** Finished batch."
end
Active Record allows you to validate the state of a model before it gets written into the database. There are several methods that you can use to check your models and validate that an attribute value is not empty, is unique and not already in the database, follows a specific format, and many more.
Returns a new relation, which is the result of filtering the current relation according to the conditions in the arguments.
Do all the work in the database which is WAY faster than moving data back and forth.
This can be accomplished with ActiveRecord. Of course PLEASE test this before you unleash it on important data.
Post
.where(user_id: nil)
.joins("inner join users on posts.user_name = users.user_name")
.update_all("posts.user_id = users.id")
Further, if posts have an index on user_id
, and users has an index on user_name
, then that will help this particular query run more quickly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With