Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

very strange behaviour with ruby, openssl, unicorn, systemd (Gcloud)

We started seeing some strange errors in our logs that normally appear when ruby isn't compiled properly with OpenSSL. But it's inconcistent...

We're getting errors like:

  • RuntimeError: Unsupported digest algorithm (SHA256). (also with other digests, like sha1). example error trace
  • Faraday::SSLError (SSL_CTX_new: (null)) example error trace

We managed to reproduce it when starting unicorn using service unicorn start or systemctl start unicorn. But only with some requests... Not all of them. Some requests that use OpenSSL under the hood do work. Others don't.

However, when we start unicorn with /etc/init.d/unicorn start, everything works without a hitch. (to clarify, systemd starts the same /etc/init.d script)

We tried debugging ENV vars, user permissions, file/dir ownership, recompile ruby, bootstrap a new server from scratch... Nothing seems to help.

In case this helps:

  • unicorn init.d script
  • unicorn.rb

What are we missing? What can we try that we haven't thought of?

UPDATE 1

  • output of some debug commands, e.g. OpenSSL, ruby etc
  • PATH is being set inside the init.d script
  • unicorn is being executed via su into www-data user
  • The same problem happens when we use this unicorn.service file in /etc/systemd/system
  • We're running Ubuntu 16.04 on Gcloud
  • Ruby was not installed via apt (explicitly removed, in case platform came pre-installed) and compiled from scratch. We're currently running 2.3.4 and tried also 2.3.6. Compiled either manually or using ruby-build. No rbenv, nor RVM.
  • We install libssl-dev via apt (we're running apt-get install -y autoconf bison build-essential libssl-dev libyaml-dev libreadline6-dev zlib1g-dev libncurses5-dev libffi-dev libgdbm3 libgdbm-dev before building ruby)

UPDATE 2

We're using a scripted/repeatable build process for the VM (using fabric), and this problem is consistent on multiple VMs we bootstrapped on GCloud. We then tried a VM on DigitalOcean with the same bootstrap scripts, and the problem doesn't seem to appear there.

In both cases we picked Ubuntu 16.04 64bit base image, but obviously there are some differences with kernel versions, base installed packages etc...

UPDATE 3

The problem simply vanished. See my answer below.

like image 381
gingerlime Avatar asked Feb 28 '18 18:02

gingerlime


2 Answers

@gingerlime I had a similar situation with our Jenkins on GCP, we're using ChefDK 3.1.0 (ruby embeed 2.5.1p57) -- tried other also, over a Jenkins that was running over systemd (Ubuntu 16.04) and upstart (Ubuntu 14.04) -- we tried on both versions, right now running over 16.04 in 4.15.0-1023-gcp kernel version, running a few jobs with kitchen-docker and this problem always emerge in a few situations.

I digged into and found that this only happens when the Etc.getlogin class gets called (for me here), this doesn't return any error, it return the correct info, the correct type of the class (String), but once it gets a call, the Unsupported digest algorithm gets raised.

If I start the process manually by root or jenkins user, this problem doesn't happen. I tried to implement the Etc.getlogin in several different ways, like using ENV['USER'], a fixed String, or other classes from Etc, like getpwuid, simulating the return class and values from Etc.getlogin, and the error doesn't get raised.

I'm not sure if this is some bug related to the ruby version and the custom kernel that GCP instances uses, but it happens in a similar situation like yours, and for me, the Etc.getlogin was the problem. Right now, I fixed by using a custom configuration that doesn't gets the call from this function, and it's working normally.

like image 53
malucelli Avatar answered Sep 26 '22 05:09

malucelli


One option is that this isn't an issue of sysVinit vs systemd at all, but you just haven't triggered the issue with your sysVinit script yet.

When you run your svsVinit script through the systemctl command it's going through a compatibility layer, and there may be a problem there. Your problem would be simplified both yourself and for us if you reproduced the issue directly with a systemd service file and shared that file.

You mentioned debugging ENV, but didn't mention exactly what you checked in the ENV. This is definitely one place where systemd could make a difference. As seen in man systemd.exec, systemd sets $PATH in the environment to a fixed value:

/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

If this is not exactly the same as when run directly as an sysVinit script, that could be an issue.

I would also check for all your copies of SSL on the system. Do you have more than one? Where? Do you have more than copy of the ruby openssl module loaded?

 locate -r lib/.*libssl.*so

Also see the answer to the FAQ: Why do things behave differently under systemd?

like image 43
Mark Stosberg Avatar answered Sep 23 '22 05:09

Mark Stosberg