Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gearman with multiple servers and php workers

Tags:

php

gearman

I'm having a problem with gearman workers running on multiple servers which i can't seem to solve.

The problem occurs when a worker server is taken offline, rather than the worker process being cancelled, and causes all other worker processes to error and fail.

Example with just 1 client and 2 workers -

Client:

$client = new GearmanClient ();

$client->addServer ('192.168.1.200');
$client->addServer ('192.168.1.201');

$job = $client->do ('generate_tile', serialize ($arrData));

Worker:

$worker = new GearmanWorker ();

$worker->addServer ('192.168.1.200');
$worker->addServer ('192.168.1.201');

$worker->addFunction ('generate_tile', 'generate_tile');

while (1)
{
    if (!$worker->work ())
    {

        switch ($worker->returnCode ())
        {

            default:
                echo "Error: " . $worker->returnCode () . ': ' . $worker->error () . "\n";
                break;

        }

    }
}

function generate_tile ($job) { ... }

The worker code is being run on 2 separate servers. When every server is up and running both workers execute jobs as expected. When one of the worker processes is cancelled, the other worker executes all jobs as expected.

However, when the server with the cancelled worker process is shutdown and taken completely offline, requests to the client script hang and the remaining worker process does not pick up any jobs.

I get the following set of errors from the remaining worker process:

Error: 46: gearman_con_wait:timeout reached
Error: 46: gearman_con_wait:timeout reached
Error: 4: gearman_con_flush:write:110
Error: 46: gearman_con_wait:timeout reached
Error: 4: gearman_con_flush:write:113
Error: 4: gearman_con_flush:write:113
Error: 4: gearman_con_flush:write:113
....

When i start-up the other server, not starting the worker process on it, the remaining worker process immediately jumps into life and executes any remaining jobs.

It seems clear to me that i need some code in the worker process to cope with any servers that may be offline, however i cannot see how to do this.

Many thanks,

Andy

like image 905
Andy Burton Avatar asked Aug 16 '11 09:08

Andy Burton


3 Answers

Our tests with multiple gearman servers shows that if the last server in the list (192.168.1.201 in your case) is taken down, the workers stop executing the way you are describing. (Also, the workers grab jobs from the last server. They process jobs on .200 only if on .201 there are no jobs).

It seems that this is a bug with the linked list in the gearman server, which is reported to be fixed multiple times, but with all available versions of gearman, the bug persist. Sorry, I know that's not a solution, but we had the same problem and didn't found a solution. (if someone can provide working solution for this problem, I agree to give large bounty)

like image 198
Maxim Krizhanovsky Avatar answered Nov 01 '22 00:11

Maxim Krizhanovsky


Further to @Darhazer 's comment above. We found that as well and solved like thus :-

// Gearman workers show a strong preference for servers at the end of a list so randomize the order
$worker = new GearmanWorker();
$s2 = explode(",", Configure::read('workers.servers'));
shuffle($s2);
$servers = implode(",", $s2);
$worker->addServers($servers); 

We run 6 to 10 workers at any time, and expire them after they've completed x requests.

like image 4
Richard Avatar answered Oct 31 '22 22:10

Richard


I use this class, which keep track of which jobs work on which servers. It hasn't been thoroughly tested, just wrote it now. I've pasted an edited version, so there might be a typo or somesuch, but otherwise appears to solve the issue.

<?
class MyGearmanClient {
        static $server = "server1,server2,server3";
        static $server_array = false;
        static $workingServers = false;
        static $gmclient = false;
        static $timeout = 5000;
        static $defaultTimeout = 5000;

        static function randomServer() {
                return self::$server_array[rand(0, count(self::$server_array) -1)];
        }

        static function getServer($job = false) {
                if (self::$server_array == false) {
                        self::$server_array = explode(",", self::$server);
                        self::$workingServers = array();
                }

                $serverList = array();
                if ($job) {
                        if (array_key_exists($job, self::$workingServers)) {
                                foreach (self::$server_array as $server) {
                                        if (array_key_exists($server, self::$workingServers[$job])) {
                                                if (self::$workingServers[$job][$server]) {
                                                        $serverList[] = $server;
                                                }
                                        } else {
                                                $serverList[] = $server;
                                        }
                                }
                                if (count($serverList) == 0) {
                                        # All servers have failed, need to insert all the servers again and retry.
                                        $serverList = self::$workingServers[$job] = self::$server_array;
                                }
                                return $serverList[rand(0, count($serverList) - 1)];
                        } else {
                                return self::randomServer();
                        }
                } else {
                        return self::randomServer();
                }
        }

        static function serverWorked($server, $job) {
                self::$workingServers[$job][$server] = $server;
        }

        static function serverFailed($server, $job) {
                self::$workingServers[$job][$server] = false;
        }

        static function Connect($server = false, $job = false) {
                if ($server) {
                        self::$server = self::getServer();
                }

                self::$gmclient= new GearmanClient();
                self::$gmclient->setTimeout(self::$timeout);

                # add the default job server
                self::$gmclient->addServer($server = self::getServer($job));

                return $server;
        }

        static function Destroy() {
                self::$gmclient = false;
        }

        static function Client($name, $vars, $timeout = false) {
                if (is_int($timeout)) {
                        self::$timeout = $timeout;
                } else {
                        self::$timeout = self::$defaultTimeout;
                }


                do {
                        $server = self::Connect(false, $name);
                        $value = self::$gmclient->do($name, $vars);
                        $return_code = self::$gmclient->returnCode();
                        if (!$value) {
                                $error_message = self::$gmclient->error();
                                if ($return_code == 47) {
                                        self::serverFailed($server, $name);
                                        if (count(self::$server_array) > 1) {
                                             // ADDED SINGLE SERVER LOOP AVOIDANCE // echo "Timeout on server $server, trying another server...\n";
                                             continue;
                                        } else {
                                             return false;
                                        }
                                }
                                echo "ERR: $error_message ($return_code)\n";
                        }
                        # printf("Worker has returned\n");
                        $short_value = substr($value, 0, 80);
                        switch ($return_code)
                        {
                        case GEARMAN_WORK_DATA:
                                echo "DATA: $short_value\n";
                                break;
                        case GEARMAN_SUCCESS:
                                self::serverWorked($server, $name);
                                break;
                        case GEARMAN_WORK_STATUS:
                                list($numerator, $denominator)= self::$gmclient->doStatus();
                                echo "Status: $numerator/$denominator\n";
                                break;
                        case GEARMAN_TIMEOUT:
                                // self::Connect();
                                // Fall through
                        default:
                                echo "ERR: $error_message " . self::$gmclient->error() . " ($return_code)\n";
                                break;
                        }
                }
                while($return_code != GEARMAN_SUCCESS);

                $rv = unserialize($value);
                return $rv["rv"];
        }
}

# Example usage:
#    $rv = MyGearmanClient::Client("Function", $args);

?>
like image 2
Orwellophile Avatar answered Oct 31 '22 23:10

Orwellophile