Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP5-FPM randomly starts consuming a lot of CPU

I've run into a really strange problem which I am not sure how to debug further. I have an NGINX + PHP5-FPM + APC Amazon Ubuntu instance and there is a website installed on it which is a complex PHP framework. While trying to debug the problem, I've reduced the flow to this: a lot of big classes get included, main objects are created, session is started, array of configs is retrieved from memcached, an XML file is retrieved from memcached, HTML templates are included, output is sent to the client.

Then I use http_load tool to put the website under the load of 20 requests per second: http_load -timeout 10 -rate 20 -fetches 10000 ./urls.txt

What happens next is rather strange. top shows a bunch of php5-fpm processes spawned each taking a few % of CPU and everything runs smoothly, like this:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28440 www-data 20 0 67352 10m 5372 S 4.3 1.8 0:20.33 php5-fpm
28431 www-data 20 0 67608 10m 5304 S 3.3 1.8 0:16.77 php5-fpm
28444 www-data 20 0 67352 10m 5372 S 3.3 1.8 0:17.17 php5-fpm
28445 www-data 20 0 67352 10m 5372 S 3.0 1.8 0:16.83 php5-fpm
28422 www-data 20 0 67608 10m 5292 S 2.3 1.8 0:18.99 php5-fpm
28424 www-data 20 0 67352 10m 5368 S 2.0 1.8 0:16.59 php5-fpm
28438 www-data 20 0 67608 10m 5304 S 2.0 1.8 0:17.91 php5-fpm
28439 www-data 20 0 67608 10m 5304 S 2.0 1.8 0:23.34 php5-fpm
28423 www-data 20 0 67608 10m 5292 S 1.7 1.8 0:20.02 php5-fpm
28430 www-data 20 0 67608 10m 5300 S 1.7 1.8 0:15.77 php5-fpm
28433 www-data 20 0 67352 10m 5372 S 1.7 1.8 0:17.08 php5-fpm
28434 www-data 20 0 67608 10m 5292 S 1.7 1.8 0:18.56 php5-fpm
20648 memcache 20 0 51568 8192 708 S 1.3 1.3 2:51.06 memcached
28420 www-data 20 0 69876 13m 6300 S 1.3 2.3 0:20.89 php5-fpm
28421 www-data 20 0 67608 10m 5300 S 1.3 1.8 0:21.19 php5-fpm
28429 www-data 20 0 9524 2260 992 S 1.3 0.4 0:11.68 nginx
28435 www-data 20 0 67608 10m 5304 S 1.3 1.8 0:18.58 php5-fpm
28437 www-data 20 0 67352 10m 5372 S 1.3 1.8 0:17.87 php5-fpm
28441 www-data 20 0 67608 10m 5292 S 1.3 1.8 0:20.75 php5-fpm

Then after some time which can be anywhere between one second and minutes, several (usually two) php5-fpm processes suddenly consume all the CPU:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28436 www-data 20 0 67608 10m 5304 R 48.5 1.8 0:23.68 php5-fpm
28548 www-data 20 0 67608 10m 5276 R 45.2 1.7 0:07.62 php5-fpm
28434 www-data 20 0 67608 10m 5292 R 2.0 1.8 0:23.28 php5-fpm
28439 www-data 20 0 67608 10m 5304 R 2.0 1.8 0:26.63 php5-fpm

At this point everything gets stuck and all new HTTP requests timeout. If I stop http_load tool, the php5-fpm will hang there for many minutes. Interestingly enough if I do php5-fpm stop, the php5-fpm processes will disappear but any commands that make use of filesystem will have problems executing. E.g. if I try to download a file via ssh, top will show the following, taking many minutes to initiate the actual download:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3298 sshd 20 0 7032 876 416 R 75.2 0.1 0:04.52 sshd
3297 sshd 20 0 7032 876 416 R 24.9 0.1 0:04.49 sshd

PHP error log usually has this:

[05-Dec-2012 20:31:39] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 58 total children
[05-Dec-2012 20:32:08] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 0 idle, and 66 total children

Nginx error log is flooded with these entries:

2012/12/05 20:31:36 [error] 4800#0: *5559 connect() to unix:/dev/shm/php-fpm-www.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: ..., server: ec2-....compute-1.amazonaws.com, request: "GET /usa/index.php?page=contact_us HTTP/1.0", upstream: "fastcgi://unix:/dev/shm/php-fpm-www.sock:", host: "ec2-....compute-1.amazonaws.com"

PHP-FPM slow log doesn't show anything interesting, swapping never happens and I didn't manage to gather any other interesting facts about the problem. I've gone through many iterations of config file changes, the most recent ones being

nginx.conf: http://pastebin.com/uaD56hJF

pool.d/www.conf: http://pastebin.com/mFeeUULC

===UPDATE 1===

site's config: http://pastebin.com/qvinVNhB

===UPDATE 2===

Also just found that dmesg reports errors like this

[6483131.164331] php5-fpm[28687]: segfault at b6ec8ff4 ip b78c3c32 sp bff551f0 error 4 in ld-2.13.so[b78b5000+1c000]

===UPDATE 3===

We've got a new Amazon EC2 micro instance just in case, to exclude possible hardware issues. Also I am using php-fastcgi now to exclude possible fpm bugs. Other differences are minor, I think the only thing that change is Ubuntu->Debian. The same problem still happens except that now server manages to slightly recover after the max_execution_time seconds (and then spikes again).

I tried playing with a separate test.php and I am not sure if it's the same issue but at least in top it looks the same. I created a test.php and included a bunch of libs that belong to our framework. The libs don't do anything except for defining classes or including other libs that define classes. I checked with APC and all of this gets successfully served by it. The I started pressuring test.php with 200 requests per second and after some time the same thing happened. Except that now I managed to get some errors saying "too many open files". It doesn't happen always though, sometimes it just starts timing out without outputting the error and a few php processes are stuck consuming all CPU. I played only a bit with it but I think there is a correlation here - by controlling the number of included libs or slightly varying requests/second rate, I can control when the CPU spike will happen. I increased the relevant OS variables but the issue is still there although it takes longer for it to happen (also note that I've set the limits to values N times larger than the total number of requests I do during tests).

fs.file-max = 70000
...
*       soft    nofile   10000
*       hard    nofile  30000
...
worker_rlimit_nofile 10000;
...
(reloaded all the configs and made sure the new system vars actually took affect)

So the next best and only explanation I can come up with so far is that even though APC is supposed to pull files from memory, internally it is implemented in a way that still uses a file descriptor whenever PHP include-s are called. And either because it releases them with a delay or when at some unfortunate moment too many requests arrive at the same moment, system runs our of descriptors and newly arriving HTTP requests get quickly stacked into a huge queue. I'll try to test this somehow.

like image 311
Eugene Avatar asked Dec 05 '12 21:12

Eugene


People also ask

How much memory does PHP-FPM use?

45 MB for the average per-request memory.

What is PHP-FPM service?

PHP-FPM (FastCGI Process Manager) is an alternative to FastCGI implementation of PHP with some additional features useful for sites with high traffic. It is the preferred method of processing PHP pages with NGINX and is faster than traditional CGI based methods such as SUPHP or mod_php for running a PHP script.

What is PHP-FPM settings?

The “FPM” part in PHP stands for “Fast Process Manager”, which is just a fancy way of saying that the PHP running on a server isn't a single process, but rather some PHP processes that are spawned, controller, and killed off by this FPM process manager.

What is PHP-FPM child process?

The PHP-FPM limit, often referred to as max_children, refers to the setting that designates the maximum number of concurrent PHP-FPM processes allowed to run on a server.


4 Answers

I've run a website with similar configuration for many months, with zero down time. I've had a look at your config, and it looks ok. That being said, I did my config quite a while ago.

I would consider reducing pm.max_requests = 10000 to something more reasonable like pm.max_requests = 500. This just means "don't use each instance for more than X number of requests". It is good not to have this number too high, because doing so gives you resilience with respect of possible PHP engine bugs.

I think the real problem is most likely in your PHP scripts. It's hard to say without knowing more.

EDIT: Consider uncommenting ;request_terminate_timeout = 0 and setting it to something like request_terminate_timeout = 20. Your scripts will then be required to complete within 20 seconds. You will most likely see a change in behaviour, but I think your site might stay live. That would indicate a PHP script error.

EDIT2: My own php-fpm config is as follows:

[example.com]
listen = /var/run/sockets/example.com.socket
user = www-data
group = www-data
pm = dynamic
pm.start_servers = 5
pm.max_children = 15
pm.min_spare_servers = 5
pm.max_spare_servers = 10
pm.max_requests = 500
php_flag[expose_php] = off
php_flag[short_open_tag] = on

EDIT3: I spotted something unexpected in your nginx config, but it may be nothing.

You are using fastcgi_ignore_client_abort on; which causes problems in worker processes under older versions of nginx. I haven't seen this problem myself, since I am running a custom compile of a recent version. Here's the description of the problem on the nginx site:

In 1.0.2 POST requests are not handled correctly when fastcgi_ignore_client_abort is set to on which can lead to workers processes segfaulting. Switching fastcgi_ignore_client_abort back to default (off) should resolve this issue.

like image 54
Kevin A. Naudé Avatar answered Oct 30 '22 00:10

Kevin A. Naudé


Simple trick but very usefull to reduce processor usage upto 50%, just edit your php-fpm config:

pm = dynamic

and change it to:

pm = ondemand
like image 42
Animanga Avatar answered Oct 29 '22 23:10

Animanga


The behavior of PHP-FPM on my Server is the same as you. bottleneck somewhere for sure.
The question turn out: How to find Bottleneck on Nginx - PHP-FPM - Mysql? The fastest way to find out is: Enable Slowlog for PHP-FPM.
Add the lines below into your php-fpm.conf pool, and make sure the path is existed

request_slowlog_timeout = 10
slowlog = /var/log/php-fpm/slow.$pool.log

By read the log backtrace, you can find out why PHP-FPM spent so much CPU or timeout. Here is my cases:

[28-Dec-2018 14:56:55]  [pool laravel] pid 19061
script_filename = /public_html/index.php
[0x00007efdda4d8100] hasChildren() /public_html/laravel/vendor/symfony/finder/Iterator/ExcludeDirectoryFilterIterator.php:75
[0x00007ffe31cd9e40] hasChildren() unknown:0
[0x00007ffe31cda200] next() unknown:0
[0x00007ffe31cda540] next() unknown:0
[0x00007ffe31cda880] next() unknown:0
[0x00007efdda4d7fa8] gc() /public_html/laravel/vendor/laravel/framework/src/Illuminate/Session/FileSessionHandler.php:91
[0x00007efdda4d7e50] gc() /public_html/laravel/vendor/laravel/framework/src/Illuminate/Session/Middleware.php:159
[0x00007efdda4d7d48] collectGarbage() /public_html/laravel/vendor/laravel/framework/src/Illuminate/Session/Middleware.php:128
[0x00007efdda4d7c20] closeSession() /public_html/laravel/vendor/laravel/framework/src/Illuminate/Session/Middleware.php:79
[0x00007efdda4d7ac8] handle() /public_html/laravel/vendor/laravel/framework/src/Illuminate/Cookie/Queue.php:47
[0x00007efdda4d7930] handle() /public_html/laravel/vendor/laravel/framework/src/Illuminate/Cookie/Guard.php:51
[0x00007efdda4d7818] handle() /public_html/laravel/vendor/stack/builder/src/Stack/StackedHttpKernel.php:23
[0x00007efdda4d76e0] handle() /public_html/laravel/vendor/laravel/framework/src/Illuminate/Foundation/Application.php:641
[0x00007efdda4d7598] run() 
/public_html/index.php:51

The backtrace mentions about these keywords:

"cookie" "session" "collectGarbage()" "laravel"

I keep searching and TADA, Laravel using RANDOM method to clear expired session. And in my config, PHP using SSD to handle Session.
When Sessions number become "very big" This make PHP spent more time to handle => High CPU Usage.

We can have many kinds of bottleneck, we can just know it when we "debugged" it.

Have a nice investigate.

like image 22
minhhq Avatar answered Oct 29 '22 23:10

minhhq


I'm going through this same problem right now, and wanted to point you to this post:

How to determine which script is being executed in PHP-FPM process

It's got to be one of your PHP scripts. See if you can connect the dots between the runaway process IDs and the .php script file that's holding you up.

Funny, this has been on a server that's been impeccably fast. I think a WordPress upgrade (plugin or core) might be very responsible.

like image 24
Berto Avatar answered Oct 29 '22 23:10

Berto