Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Updating massive number of records -- performance optimization

I have a baseball tool that allows users to analyze a player's historical batting stats. For example, how many hits does A-Rod have over the past 7 days during night-time conditions? I want to expand the timeframe so a user can analyze a player's batting stats to as far back as 365 days. However, doing so requires some serious performance optimization. Here are my current set of models:

class AtBat < ActiveRecord::Base
  belongs_to :batter
  belongs_to :pitcher
  belongs_to :weather_condition

  ### DATA MODEL ###
  # id
  # batter_id
  # pitcher_id
  # weather_condition_id
  # hit (boolean)
  ##################
end

class BattingStat < ActiveRecord::Base
  belongs_to :batter
  belongs_to :recordable, :polymorphic => true # e.g., Batter, Pitcher, WeatherCondition

  ### DATA MODEL ###
  # id
  # batter_id
  # recordable_id
  # recordable_type
  # hits7
  # outs7
  # at_bats7
  # batting_avg7
  # ...
  # hits365
  # outs365
  # at_bats365
  # batting_avg365
  ##################
end

class Batter < ActiveRecord::Base
  has_many :batting_stats, :as => :recordable, :dependent => :destroy
  has_many :at_bats, :dependent => :destroy
end

class Pitcher < ActiveRecord::Base
  has_many :batting_stats, :as => :recordable, :dependent => :destroy
  has_many :at_bats, :dependent => :destroy
end

class WeatherCondition < ActiveRecord::Base
  has_many :batting_stats, :as => :recordable, :dependent => :destroy
  has_many :at_bats, :dependent => :destroy
end

For the sake of keeping my question at a reasonable length, let me narrate what I am doing to update the batting_stats table instead of copying a bunch of code. Let's start with 7 days.

  1. Retrieve all the at_bat records over the past 7 days.
  2. Iterate over each at_bat record…
  3. Given an at_bat record, grab the associated batter and associated weather_condition, find the correct batting_stat record (BattingStat.find_or_create_by_batter_and_recordable(batter, weather_condition), then update the batting_stat record.
  4. Repeat Step 3 for batter and pitcher (recordables).

Steps 1-4 are repeated for other time periods as well -- 15 days, 30 days, etc.

Now I imagine how laborious this would be to run a script every day to make these updates if I were to expand the time periods from a mangeable 7/15/30 to 7/15/30/45/60/90/180/365.

So my question is how would you approach getting this to run at the highest levels of performance?

like image 401
keruilin Avatar asked Nov 16 '11 01:11

keruilin


People also ask

Does multiple updates in a table will decrease the performance?

The single UPDATE is faster. That is, multiple UPDATE turned out to be 5-6 times slower than single UPDATE . Save this answer.

How do you improve the performance of a update statement?

Best practices to improve SQL update statement performance We need to consider the lock escalation mode of the modified table to minimize the usage of too many resources. Analyzing the execution plan may help to resolve performance bottlenecks of the update query. We can remove the redundant indexes on the table.

What is the best way to update millions of records in Oracle?

The fastest way to update the bulk of records is using the Merge statement. The merge statement took 36 seconds to update records in fast way.


1 Answers

AR isn't really meant to do bulk processing like this. You're probably better off doing your batch updates by dropping into SQL proper and doing an INSERT FROM SELECT (or perhaps using a gem that did this for you.)

like image 179
phs Avatar answered Nov 07 '22 16:11

phs