Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which socket accept() errors are fatal?

I am writing a TCP server program in C++ using Boost.Asio, which uses POSIX sockets under the hood in Linux. I have everything working with a loop that continually accepts new connections and spawns server-side sessions with sockets that are initialized upon the successful completion of ip::tcp::acceptor::async_accept (which is a wrapper around the POSIX accept function for you C programmers out there).

Of course, the async_accept operation can emit an error code. Upon receiving such error codes, I log them, and continue the accept loop indefinitely until the program is terminated.

I would like to classify the errors in two categories:

  1. "Non-fatal" errors that are entirely due to the client (e.g. disconnecting, protocol violation). When these occur, the program should continue listening for new client connections. If the program doesn't continue in these cases, then one misbehaving client can effectively DoS the whole system.
  2. "Fatal" errors that are due to programming error or bad server configuration. When these occur, the program should break out of the listening loop and terminate.

How can I tell if an error code belongs to either of my so-called "fatal" and "non-fatal" categories? Is there an open-source application/library out there that already figured this out, so that I can peruse it for inspiration? It doesn't need to use Boost.Asio; something that directly uses (or wraps) POSIX sockets would be useful as well.

What makes matters more complicated is that I'd like my program to be portable, but Linux would be my primary target.


ADDENDUM: This question also applies to C programs that use the POSIX API directly. Boost.Asio is a low-level wrapper around the POSIX API and passes most underlying POSIX error codes intact.

ADDENDUM2: I was asked in the comments to provide example code. Here is the Boost.Asio TCP echo server example modified to break the listening loop upon what I call a "fatal" error. The server::is_fatal_error function is where I need help. Note that I know the mechanics of comparing boost::system::error_code to error conditions. It's the list of actual error conditions that I'm not sure of.

//
// async_tcp_echo_server.cpp
// ~~~~~~~~~~~~~~~~~~~~~~~~~
//
// Copyright (c) 2003-2023 Christopher M. Kohlhoff (chris at kohlhoff dot com)
//
// Distributed under the Boost Software License, Version 1.0. (See accompanying
// file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
//
// *** Modified by Emile Cormier for discussion purposes ***

#include <cstdlib>
#include <iostream>
#include <memory>
#include <system_error>
#include <utility>
#include <boost/asio.hpp>

using boost::asio::ip::tcp;

class session
  : public std::enable_shared_from_this<session>
{
public:
  session(tcp::socket socket) : socket_(std::move(socket)) {}

  void start() {do_read();}

private:
  void do_read()
  {
    auto self(shared_from_this());
    socket_.async_read_some(boost::asio::buffer(data_, max_length),
        [this, self](boost::system::error_code ec, std::size_t length)
        {
          if (!ec)
            do_write(length);
        });
  }

  void do_write(std::size_t length)
  {
    auto self(shared_from_this());
    boost::asio::async_write(socket_, boost::asio::buffer(data_, length),
        [this, self](boost::system::error_code ec, std::size_t /*length*/)
        {
          if (!ec)
            do_read();
        });
  }

  tcp::socket socket_;
  enum { max_length = 1024 };
  char data_[max_length];
};

class server
{
public:
  server(boost::asio::io_context& io_context, short port)
    : acceptor_(io_context, tcp::endpoint(tcp::v4(), port))
  {
    do_accept();
  }

private:
  static bool is_fatal_error(boost::system::error_code ec)
  {
    // *** How do I classify the error as fatal so that it bails out ***
    // *** of the listen loop? ***
    return false; // Return something so that it compiles.
  }
    
  void do_accept()
  {
    acceptor_.async_accept(
        [this](boost::system::error_code ec, tcp::socket socket)
        {
          if (!ec)
          {
            std::make_shared<session>(std::move(socket))->start();
          }
          else if (is_fatal_error(ec))
          {
            // Break out of listening loop if error is due to problem
            // on server side (e.g. misconfiguration)
            std::cerr << "Fatal error: " << ec.message() << std::endl;
            throw std::system_error{ec};
          }

          do_accept();
        });
  }

  tcp::acceptor acceptor_;
};

int main(int argc, char* argv[])
{
  try
  {
    if (argc != 2)
    {
      std::cerr << "Usage: async_tcp_echo_server <port>\n";
      return 1;
    }

    boost::asio::io_context io_context;
    server s(io_context, std::atoi(argv[1]));
    io_context.run();
  }
  catch (std::exception& e)
  {
    std::cerr << "Exception: " << e.what() << "\n";
  }
  return 0;
}

ADDENDUM 3: One of the comments made me realize that there's another category of errors I need to consider: when the accept function fails due to the server being overloaded (which is not the client's fault). I should not break out of the listening loop in those circumstances either.

like image 313
Emile Cormier Avatar asked Jun 26 '26 09:06

Emile Cormier


1 Answers

I've studied the NGINX ngx_event_accept function to understand how it handles errors from accept() system call. Here is a summary of my findings (please correct me if I misunderstood something).

  • EAGAIN: This is for non-blocking mode and doesn't apply to me when using Boost.Asio's asynchronous API. NGINX does nothing and waits for the next "cycle" to try again.
  • ECONNABORTED: This occurs when the client disconnects while in the accept queue. NGINX logs this at "error" severity and proceeds to the next socket awaiting to be accepted.
  • EMFILE and ENFILE: The per-process or system-wide limit on file descriptors has been reached. If I understood the code correctly, NGINX waits for 500 milliseconds before attempting to perform the accept operation again. This is logged at "critical" severity. The delay is presumably to give time for some TCP connections to close before listening again.
  • All other errors: They are logged at "alert" severity and NGINX will attempt to accept again without terminating. I'm guessing this would result in the error log getting flooded if it's not an intermittent problem, but I haven't found anyone mentioning this happening in my internet searches.

I've also studied the Apache httpd server code, in particular, the ap_unixd_accept function.

When the AcceptErrorsNonFatal directive is off (which is the default), httpd treats all accept errors (except for EINTR) as fatal, and gracefully shuts down the child process:

By default, the child process handling a request will gracefully exit when nearly any socket error occurs during the accept() system call. This is to ensure a potentially unhealthy child process does not try to take on more new connections.

When the AcceptErrorsNonFatal directive in on, ECONNREFUSED, ECONNABORTED, and ECONNRESET are not treated as fatal.

The Apache MPM worker documentation describes how the server tries to maintain a minimum number of spare worker threads, and the number of running child processes determines the total number of available threads. So if I understand things correctly, a child process that was (gracefully) shutdown due to an accept error will eventually be replaced by a new child process if the load stays the same.


The use case of a port already being in use would trigger an error in the bind system call. I forgot about bind in my question because the boost::ip::tcp::acceptor constructor has an overload that takes care of this (and throws exceptions upon errors). I need to refactor my code so that I manually open/bind via the acceptor, so I can can handle and log errors gracefully.


I found this blog post that describes the same problem I'm facing with accept(). They also classify the error codes in two categories:

  1. Permanent, due to a programming error
  2. Transient (e.g. bad connection)

Sadly, that post has no list of error codes for each of those categories, and the blogger warns about Unix implementations adding extra error codes on top of those specified by POSIX.


To get back to my original question, there are no errors from accept that should cause the server to terminate if I am to emulate the behavior of NGINX or Apache httpd. Since my server will be single-process (but could support thread pools), NGINX would be a better example to emulate.

There are some published error codes for accept() that are due to programming error, so I might deviate from NGINX/httpd's behavior and make my program crash when I don't recognize the error code as being transient.

Here is a table of accept error codes from a few OSes, where I attempt to classify each one according the documentation.

Code POSIX Linux BSD/iOS Win32 accept Win32 WSAAccept**
EAGAIN Asio Asio Asio Asio Asio?
EWOULDBLOCK Asio Asio Asio Asio Asio?
ECONNABORTED Asio Asio Asio Asio Asio?
EINTR Asio Asio Asio Asio Asio?
EPROTO Asio Asio Asio Asio Asio?
EBADF Fatal Fatal Fatal Fatal Fatal
ENOTSOCK Fatal Fatal Fatal Fatal Fatal
EOPNOTSUPP Fatal Retry* Fatal Fatal Fatal
EINVAL Fatal Fatal Fatal Fatal Fatal
EMFILE Load Load Load Load Load
ENFILE Load Load Load -- --
ENOBUFS Load Load Load Load Load
ENOMEM Load Load Load Load Load
EPERM -- Fwall -- -- --
ENOSR -- Load -- -- --
ESOCKTNOSUPPORT -- Fatal -- -- --
EPROTONOSUPPORT -- Fatal -- -- --
ETIMEDOUT -- Retry? -- -- --
ENETDOWN -- Down -- Down Down
ENOPROTOOPT -- Down -- -- --
EHOSTDOWN -- Down/Retry -- -- --
ENONET -- Down -- -- --
EHOSTUNREACH -- Down/Retry -- -- --
ENETUNREACH -- Down -- -- --
EFAULT -- -- Fatal Fatal Fatal
WSANOTINITIALISED -- -- -- Fatal Fatal
ECONNRESET -- -- -- Retry Retry
EINPROGRESS -- -- -- n/a n/a
EACCESS -- -- -- -- Fatal
ECONNREFUSED -- -- -- -- Retry
WSTRY_AGAIN -- -- -- -- Retry

Legend:

Label Meaning
Asio Consumed by Asio
Down Transient network down error
Fatal Programming error
Fwall Firewall forbids connection (could be fatal or transient)
Load Transient high load error
Retry Brief transient error

The Linux man page for accept() has this note on error handling:

Linux accept() (and accept4()) passes already-pending network errors on the new socket as an error code from accept(). This behavior differs from other BSD socket implementations. For reliable operation the application should detect the network errors defined for the protocol after accept() and treat them like EAGAIN by retrying. In the case of TCP/IP, these are ENETDOWN, EPROTO, ENOPROTOOPT, EHOSTDOWN, ENONET, EHOSTUNREACH, EOPNOTSUPP, and ENETUNREACH.

(*) The EOPNOTSUPP code in the Linux man page is puzzling. The above paragraph says to treat it like EAGAIN, but in the list below it says "The referenced socket is not of type SOCK_STREAM", which would be a programmer error. Asio's strong typing should make passing the wrong socket type impossible if I don't use the socket constructor that takes a native socket handle.

(**) Asio actually uses AcceptEx, but the docs for that function don't list its error codes. I instead listed the error codes for WSAAccept, should AcceptEx happen to be implemented in terms of WSAAccept.

For the "network down" type of error codes, I think it would be prudent to introduce a delay before reattempting, like what NGINX does for the EMFILE and ENFILE errors.

When it comes to EAGAIN, EWOULDBLOCK, ECONNABORTED, and EINTR, boost::asio::ip::tcp::acceptor::async_accept already handles them internally and retries the accept operation without emitting an error. This behavior for ECONNABORTED can be overridden with the enable_connection_aborted special socket option. async_accept also treats an EPROTO error the same as ECONNABORTED. The function where this happens can be found here.

async_accept emits a boost::asio::error::already_open error here if the socket is already open.


Putting all this information together, I've come up with this function (untested) to classify an error code returned by async_accept:

TcpAcceptErrorCategory classifyAcceptError(boost::system::error_code ec,
                                           bool treatUnknownErrorsAsFatal)
{
    namespace sys = boost::system;

    // Check for transient errors due to heavy load.
    if (   ec == std::errc::no_buffer_space
        || ec == std::errc::not_enough_memory
        || ec == std::errc::too_many_files_open
        || ec == std::errc::too_many_files_open_in_system
#if defined(__linux__)
        || ec == sys::error_code{ENOSR, sys::system_category()}
#endif
        )
    {
        return TcpAcceptErrorCategory::resources;
    }

    // Check for network outagage errors.
#if defined(__linux__)
    if (   ec == std::errc::network_down
        || ec == std::errc::network_unreachable
        || ec == std::errc::no_protocol_option // "Protocol not available"
        || ec == std::errc::operation_not_permitted // Denied by firewall
        || ec == sys::error_code{ENONET, sys::system_category()})
    {
        return TcpAcceptErrorCategory::network;
    }
#elif defined(_WIN32) || defined(__CYGWIN__)
    if (ec == std::errc::network_down)
        return TcpAcceptErrorCategory::network;
#endif

    // Check for other transient errors. Asio already takes care of
    // EAGAIN, EWOULDBLOCK, ECONNABORTED, EPROTO, and EINTR.
#if defined(__linux__)
    if (   ec == std::errc::host_unreachable
        || ec == std::errc::operation_not_supported
        || ec == std::errc::timed_out
        || ec == sys::error_code{EHOSTDOWN, sys::system_category()})
    {
        return TcpAcceptErrorCategory::retry;
    }
#elif defined(_WIN32) || defined(__CYGWIN__)
    if (   ec == std::errc::connection_refused
        || ec == std::errc::connection_reset
        || ec == sys::error_code{WSATRY_AGAIN, sys::system_category()})
    {
        return TcpAcceptErrorCategory::retry;
    }
#endif

    if (treatUnknownErrorsAsFatal)
        return TcpAcceptErrorCategory::fatal;

    // Check for programming errors
    if (   ec == boost::asio::error::already_open
        || ec == std::errc::bad_file_descriptor
        || ec == std::errc::not_a_socket
        || ec == std::errc::invalid_argument
#if !defined(__linux__)
        || ec == std::errc::operation_not_supported
#endif
#if defined(BSD) || defined(__APPLE__)
        || ec == std::errc::bad_address // EFAULT
#elif defined(_WIN32) || defined(__CYGWIN__)
        || ec == std::errc::bad_address // EFAULT
        || ec == std::errc::permission_denied
        || ec == sys::error_code{WSANOTINITIALISED, sys::system_category()}
#endif
        )
    {
        return TcpAcceptErrorCategory::fatal;
    }

    return TcpAcceptErrorCategory::retry;
}

I hope these findings help others faced with a similar predicament.

like image 115
Emile Cormier Avatar answered Jun 29 '26 07:06

Emile Cormier