Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sendmsg fails with error code 3 (ESRCH )

OS: Linux 2.6.24 (x86)

My application runs on a server where several clients connects to it on UDP port 4500.
Intermittently, application fails to send UDP traffic to clients on UDP port 4500

This is because sendmsg system-call fails with error code 3 (ESRCH )
man page for sendmsg doesn't talk about error ESRCH

Problem doesn't resolve even after killing the application and relaunching it.
UDP traffic on other ports working fine

Rebooting the server is the only solution.

With kernel 2.6.11, I haven't seen issues like this.

Any idea on how to debug this issue ?

like image 992
user345794 Avatar asked May 11 '12 13:05

user345794


People also ask

Is errno set on success?

errno is set on an error in a system-level call. Because errno holds the value for the last call that set it, this value may be changed by succeeding calls. Run-time library calls that set errno on an error do not clear errno on success.

What type is errno?

errno is defined by the ISO C standard to be a modifiable lvalue of type int, and must not be explicitly declared; errno may be a macro. errno is thread-local; setting it in one thread does not affect its value in any other thread. Error numbers and names Valid error numbers are all positive numbers.

Where is errno declared?

This variable is declared in the header file errno. h . The variable errno contains the system error number. You can change the value of errno .


1 Answers

To debug this issue, given the information available I think the best starting place is to see if there is a way that sendmsg can return ESRCH. First we need to get the source for the particular kernel version you have seen the issue on, I found it here

After some digging we can see that the following chain may execute:

net/ipv4/udp.c:646:     
    err = ip_route_output_flow(&rt, &fl, sk, 1);

net/ipv4/route.c:2421:
    err = __xfrm_lookup((struct dst_entry **)rp, flp, sk, flags);

net/xfrm/xfrm_policy.c:1380:
    pols[1] = xfrm_policy_lookup_bytype(XFRM_POLICY_TYPE_MAIN,
                            fl, family,
                            XFRM_POLICY_OUT);

net/xfrm/xfrm_policy.c:890:
    err = xfrm_policy_match(pol, fl, type, family, dir);

Finally, we end up at net/xfrm/xfrm_policy.c:854:xrfm_policy_match

/*
 * Find policy to apply to this flow.
 *
 * Returns 0 if policy found, else an -errno.
 */
static int xfrm_policy_match(struct xfrm_policy *pol, struct flowi *fl,
             u8 type, u16 family, int dir)
{
    struct xfrm_selector *sel = &pol->selector;
    int match, ret = -ESRCH;

    if (pol->family != family ||
        pol->type != type)
        return ret;

    match = xfrm_selector_match(sel, fl, family);
    if (match)
        ret = security_xfrm_policy_lookup(pol, fl->secid, dir);

    return ret;
}

So, it looks like the error is coming from xfrm_policy_match if you inspect code in xfrm_lookup_bytype you will find a loop that continues until an iterator is exhausted or the return value of xrfm_policy_match is not ESRCH.

This tells us that your sendmsg calls are failing because there is no xfrm policy for your port. As you state that it works, then the error occurs and persists this suggests that the xfrm policies on your system are being tweaked or corrupted.

From looking at the xfrm man page here we can see that there are some tools for investigating the policies. From reading the man page my next step would be running ip xfrm state list when the issue has not occurred and after it has occurred and comparing the output. Unfortunately I don't have a running system with a 2.6.24 kernel to dig any deeper.

Note that I don't have any tricks to arrive at this conclusion, it was determined through inspection of the code, grepping and finding. This can take a lot of time and effort especially when you are not familiar with a code base. To fix the issue as opposed to debug it I would of tried different kernel versions before digging this deep.


TL;DR

It looks like the ESRCH error comes from a network subsystem called xfrm, there are a set of tools for investigating xfrm which can be found on its man page here

It seems most likely that the error is due to a missing policy for the address/port you are trying to send to. This may be due to a configuration change while the system is running or a bug causing corruption of the xfrm policies.

like image 68
Michael Shaw Avatar answered Oct 11 '22 15:10

Michael Shaw