Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rails query execution causes database spikes

I'm having a problem with my Rails application where some random queries take around 5 seconds or longer to finish. Most of the time the queries are very simple (select * from x where id = ?) and the fields are even indexed too.

Here's some more information about the setup:

  • Puma 3.5.0 behind a reversed nginx proxy
    • 4 workers with minimum 4, max 8 threads each.
  • Ruby v2.2.3, Rails v4.2.4
  • PostgreSQL 9.4 database
    • Thread pool set to max 60 connections
  • Appsignal for monitoring
  • 8GB RAM, 4 CPU's, SSD.

I found this out when looking at the query performance in Appsignal. I noticed most queries finishing in a few ms and then every now and then, still in the same request, there are multiple queries that take 5+ seconds to finish. And the odd part is that it ALWAYS takes 5,.. seconds. Here's a picture of that in action: Appsignal performance

Things I've tried:

  • Increase the thread pool to make sure the puma worker threads have enough connection objects.
  • Set 'reaping_frequency' to 10s to make sure there are no dead connections being used.
  • Increase puma workers/threads

I'm noticing this in the application as there are some pages that take a long time to load (I have a function call that takes about 1 minute to finish) and somehow this is blocking new requests. This is strange to me as there are 4 workers each with 8 threads = 32 threads that can handle the other requests.

I ran an explain on the query in the picture above, this is the output:

Limit  (cost=0.28..8.30 rows=1 width=150)
  ->  Index Scan using index_addresses_on_addressable_id_and_addressable_type on addresses  (cost=0.28..8.30 rows=1 width=150)
        Index Cond: ((addressable_id = 1) AND ((addressable_type)::text = 'University'::text))
        Filter: (deleted_at IS NULL)
Total query runtime: 13 ms

And this is the schema of the addresses table:

# Table name: addresses
#
#  id               :integer          not null, primary key
#  street           :string
#  zip_code         :string
#  city             :string
#  country          :string
#  addressable_id   :integer
#  addressable_type :string
#  created_at       :datetime         not null
#  updated_at       :datetime         not null
#  street_number    :string
#  latitude         :float
#  longitude        :float
#  mobile           :string
#  phone            :string
#  email            :string
#  deleted_at       :datetime
#  name             :string`

Here's my Puma config file:

#!/usr/bin/env puma

directory '/home/deployer/apps/qeystate/current'
rackup "/home/deployer/apps/qeystate/current/config.ru"
environment 'staging'   
pidfile "/home/deployer/apps/qeystate/shared/tmp/pids/puma.pid"
state_path "/home/deployer/apps/qeystate/shared/tmp/pids/puma.state"
stdout_redirect '/home/deployer/apps/qeystate/shared/log/puma_access.log', '/home/deployer/apps/qeystate/shared/log/puma_error.log', true
threads 4,8
bind 'unix:///home/deployer/apps/qeystate/shared/tmp/sockets/puma.sock'
workers 4
preload_app!
prune_bundler

on_worker_boot do
  ActiveSupport.on_load(:active_record) do
    ActiveRecord::Base.establish_connection
  end
end

before_fork do
  ActiveRecord::Base.connection_pool.disconnect!
end
like image 707
Ruben Homs Avatar asked Aug 03 '16 11:08

Ruben Homs


1 Answers

I would suggest a couple of things - two possible solutions, one testing/reproduction approach, and one suggestion for deeper metrics.

1) Possible quick solution: Spin-off that 1 minute job so it is non-blocking. See if the problem resolves from that. Try using Redis+Sidekiq which is pretty simple to get up and running (or something similar).

2) Second possible solution: Look for any full table locks or exclusive row locks being made to Postgres - see if you EVER make full table locks, and if so, find the offending statement and eliminate it.

3) Testing/replication: For testing, see if you can replicate this problem outside of production. I'd recommend jmeter as a very useful tool to simulate a lot of requests and requests of different types, and see if you can repro this in a controlled/staging context. Consistent replication is the key to resolving issues like this. Refer to your production server logs around the time that the issue occurs to generate your jmeter test requests that will hopefully help reproduce the issue.

If you can figure out a way to replicate it, then you can start tuning the simulation to see if removing or increasing/decreasing various requests eliminates the problem or changes the problem in some way..

4) Analytics: Install NewRelic or similar analytics gem to get a deeper insight into what's going on when that request comes in. You really want to get a solid picture as to whether the request is truly blocked in Postgres (by an exclusive row/table lock that is blocking your query) or whether you are backed up by a slow running query in the Puma execution queue, or somewhere inside Ruby there's an unfortunate wait state somehow.

You don't yet have enough information to solve this issue, so you really want to start exploring solutions by collecting data, alongside hypotheses of what's happening.

My general strategy for this kind of problem is (in this order):

  • Try some quick/easy/safe fixes (in the blind) and see if anything is resolved/changed.
  • Try to replicate in a non-production environment (really, really try to make this work).
  • Instrument the system to collect data to see what you can learn about the problem and anything related.
like image 77
Steve Midgley Avatar answered Nov 05 '22 20:11

Steve Midgley