I'm having a problem with my Rails application where some random queries take around 5 seconds or longer to finish. Most of the time the queries are very simple (select * from x where id = ?
) and the fields are even indexed too.
Here's some more information about the setup:
I found this out when looking at the query performance in Appsignal. I noticed most queries finishing in a few ms and then every now and then, still in the same request, there are multiple queries that take 5+ seconds to finish. And the odd part is that it ALWAYS takes 5,.. seconds. Here's a picture of that in action:
Things I've tried:
I'm noticing this in the application as there are some pages that take a long time to load (I have a function call that takes about 1 minute to finish) and somehow this is blocking new requests. This is strange to me as there are 4 workers each with 8 threads = 32 threads that can handle the other requests.
I ran an explain on the query in the picture above, this is the output:
Limit (cost=0.28..8.30 rows=1 width=150)
-> Index Scan using index_addresses_on_addressable_id_and_addressable_type on addresses (cost=0.28..8.30 rows=1 width=150)
Index Cond: ((addressable_id = 1) AND ((addressable_type)::text = 'University'::text))
Filter: (deleted_at IS NULL)
Total query runtime: 13 ms
And this is the schema of the addresses table:
# Table name: addresses
#
# id :integer not null, primary key
# street :string
# zip_code :string
# city :string
# country :string
# addressable_id :integer
# addressable_type :string
# created_at :datetime not null
# updated_at :datetime not null
# street_number :string
# latitude :float
# longitude :float
# mobile :string
# phone :string
# email :string
# deleted_at :datetime
# name :string`
Here's my Puma config file:
#!/usr/bin/env puma
directory '/home/deployer/apps/qeystate/current'
rackup "/home/deployer/apps/qeystate/current/config.ru"
environment 'staging'
pidfile "/home/deployer/apps/qeystate/shared/tmp/pids/puma.pid"
state_path "/home/deployer/apps/qeystate/shared/tmp/pids/puma.state"
stdout_redirect '/home/deployer/apps/qeystate/shared/log/puma_access.log', '/home/deployer/apps/qeystate/shared/log/puma_error.log', true
threads 4,8
bind 'unix:///home/deployer/apps/qeystate/shared/tmp/sockets/puma.sock'
workers 4
preload_app!
prune_bundler
on_worker_boot do
ActiveSupport.on_load(:active_record) do
ActiveRecord::Base.establish_connection
end
end
before_fork do
ActiveRecord::Base.connection_pool.disconnect!
end
I would suggest a couple of things - two possible solutions, one testing/reproduction approach, and one suggestion for deeper metrics.
1) Possible quick solution: Spin-off that 1 minute job so it is non-blocking. See if the problem resolves from that. Try using Redis+Sidekiq which is pretty simple to get up and running (or something similar).
2) Second possible solution: Look for any full table locks or exclusive row locks being made to Postgres - see if you EVER make full table locks, and if so, find the offending statement and eliminate it.
3) Testing/replication: For testing, see if you can replicate this problem outside of production. I'd recommend jmeter as a very useful tool to simulate a lot of requests and requests of different types, and see if you can repro this in a controlled/staging context. Consistent replication is the key to resolving issues like this. Refer to your production server logs around the time that the issue occurs to generate your jmeter test requests that will hopefully help reproduce the issue.
If you can figure out a way to replicate it, then you can start tuning the simulation to see if removing or increasing/decreasing various requests eliminates the problem or changes the problem in some way..
4) Analytics: Install NewRelic or similar analytics gem to get a deeper insight into what's going on when that request comes in. You really want to get a solid picture as to whether the request is truly blocked in Postgres (by an exclusive row/table lock that is blocking your query) or whether you are backed up by a slow running query in the Puma execution queue, or somewhere inside Ruby there's an unfortunate wait state somehow.
You don't yet have enough information to solve this issue, so you really want to start exploring solutions by collecting data, alongside hypotheses of what's happening.
My general strategy for this kind of problem is (in this order):
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With