This question is similar to Network port open, but no process attached? and netstat shows a listening port with no pid but lsof does not. But the answers to them can't solve mine, since it is so weird.
I have a server application called lps
that waits for tcp connections on port 8588.
[root@centos63 lcms]# netstat -lnp | grep 8588
tcp 0 0 0.0.0.0:8588 0.0.0.0:* LISTEN 6971/lps
As you can see, nothing is wrong with the listening socket, but when I connect some thousand test clients(written by another colleague) to the server, whether it's 2000, 3000, or 4000. There have always been 5 clients(which are also random) that connect and send login request to the server, but cannot receive any response. Take 3000 clients as an example. This is what the netstat
command gives:
[root@centos63 lcms]# netstat -nap | grep 8588 | grep ES | wc -l
3000
And this is lsof
command output:
[root@centos63 lcms]# lsof -i:8588 | grep ES | wc -l
2995
That 5 connections are here:
[root@centos63 lcms]# netstat -nap | grep 8588 | grep -v 'lps'
tcp 92660 0 192.168.0.235:8588 192.168.0.241:52658 ESTABLISHED -
tcp 92660 0 192.168.0.235:8588 192.168.0.241:52692 ESTABLISHED -
tcp 92660 0 192.168.0.235:8588 192.168.0.241:52719 ESTABLISHED -
tcp 92660 0 192.168.0.235:8588 192.168.0.241:52721 ESTABLISHED -
tcp 92660 0 192.168.0.235:8588 192.168.0.241:52705 ESTABLISHED -
The 5 above shows that they are connected to the server on port 8588 but no program attached. And the second column(which is RECV-Q
) keeps increasing as the clients are sending the request.
The links above say something about NFS mount and RPC. As for RPC, I used the command rcpinfo -p
and the result has nothing to do with port 8588. And NFS mount, nfssta
output says Error: No Client Stats (/proc/net/rpc/nfs: No such file or directory).
Question : How can this happen? Always 5 and also not from the same 5 clients. I don't think it's port conflict as the other clients are also connected to the same server IP and port and they are all properly handled by the server.
Note: I'm using Linux epoll
to accept client requests. I also write debug code in my program and record every socket(along with the clients' information) that accept
returns but cannot find the 5 connections. This is uname -a
output:
Linux centos63 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
Thanks for your kind help! I'm really confused.
Update 2013-06-08:
After upgrading the system to CentOS 6.4, the same problem occurs. Finally I returned to epoll
, and found this page saying that set listen fd to be non-blocking and accept
till EAGAIN
or EWOULDBLOCK
error returns. And yes, it works. No more connections are pending. But why is that? The Unix Network Programming Volume 1 says
accept is called by a TCP server to return the next completed connection from the
front of the completed connection queue. If the completed connection queue is empty,
the process is put to sleep (assuming the default of a blocking socket).
So if there are still some completed connections in the queue, why the process is put to sleep?
Update 2013-7-1:
I use EPOLLET
when adding the listening socket, so I can't accept all if not keeping accept till EAGAIN
encountered. I just realized this problem. My fault. Remember: always read
or accept
till EAGAIN
comes out if using EPOLLET
, even if it is listening socket. Thanks again to Matthew for proving me with a testing program.
I've tried duplicating your problem using the following parameters:
I cannot duplicate the problem. Here is my server source code.
#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <netdb.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/epoll.h>
#include <err.h>
#include <sysexits.h>
#include <string.h>
#include <unistd.h>
struct {
int numfds;
int numevents;
struct epoll_event *events;
} connections = { 0, 0, NULL };
static int create_srv_socket(const char *port) {
int fd = -1;
int rc;
struct addrinfo *ai = NULL, hints;
memset(&hints, 0, sizeof(hints));
hints.ai_flags = AI_PASSIVE;
if ((rc = getaddrinfo(NULL, port, &hints, &ai)) != 0)
errx(EX_UNAVAILABLE, "Cannot create socket: %s", gai_strerror(rc));
if ((fd = socket(ai->ai_family, ai->ai_socktype, ai->ai_protocol)) < 0)
err(EX_OSERR, "Cannot create socket");
if (bind(fd, ai->ai_addr, ai->ai_addrlen) < 0)
err(EX_OSERR, "Cannot bind to socket");
rc = 1;
if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &rc, sizeof(rc)) < 0)
err(EX_OSERR, "Cannot setup socket options");
if (listen(fd, 25) < 0)
err(EX_OSERR, "Cannot setup listen length on socket");
return fd;
}
static int create_epoll(void) {
int fd;
if ((fd = epoll_create1(0)) < 0)
err(EX_OSERR, "Cannot create epoll");
return fd;
}
static bool epoll_join(int epollfd, int fd, int events) {
struct epoll_event ev;
ev.events = events;
ev.data.fd = fd;
if ((connections.numfds+1) >= connections.numevents) {
connections.numevents+=1024;
connections.events = realloc(connections.events,
sizeof(connections.events)*connections.numevents);
if (!connections.events)
err(EX_OSERR, "Cannot allocate memory for events list");
}
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &ev) < 0) {
warn("Cannot add socket to epoll set");
return false;
}
connections.numfds++;
return true;
}
static void epoll_leave(int epollfd, int fd) {
if (epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, NULL) < 0)
err(EX_OSERR, "Could not remove entry from epoll set");
connections.numfds--;
}
static void cleanup_old_events(void) {
if ((connections.numevents - 1024) > connections.numfds) {
connections.numevents -= 1024;
connections.events = realloc(connections.events,
sizeof(connections.events)*connections.numevents);
}
}
static void disconnect(int fd) {
shutdown(fd, SHUT_RDWR);
close(fd);
return;
}
static bool read_and_reply(int fd) {
char buf[128];
int rc;
memset(buf, 0, sizeof(buf));
if ((rc = recv(fd, buf, sizeof(buf), 0)) <= 0) {
rc ? warn("Cannot read from socket") : 1;
return false;
}
if (send(fd, buf, rc, MSG_NOSIGNAL) < 0) {
warn("Cannot send to socket");
return false;
}
return true;
}
int main()
{
int srv = create_srv_socket("8558");
int ep = create_epoll();
int rc = -1;
struct epoll_event *ev = NULL;
if (!epoll_join(ep, srv, EPOLLIN))
err(EX_OSERR, "Server cannot join epollfd");
while (1) {
int i, cli;
rc = epoll_wait(ep, connections.events, connections.numfds, -1);
if (rc < 0 && errno == EINTR)
continue;
else if (rc < 0)
err(EX_OSERR, "Cannot properly perform epoll wait");
for (i=0; i < rc; i++) {
ev = &connections.events[i];
if (ev->data.fd != srv) {
if (ev->events & EPOLLIN) {
if (!read_and_reply(ev->data.fd)) {
epoll_leave(ep, ev->data.fd);
disconnect(ev->data.fd);
}
}
if (ev->events & EPOLLERR || ev->events & EPOLLHUP) {
if (ev->events & EPOLLERR)
warn("Error in in fd: %d", ev->data.fd);
else
warn("Closing disconnected fd: %d", ev->data.fd);
epoll_leave(ep, ev->data.fd);
disconnect(ev->data.fd);
}
}
else {
if (ev->events & EPOLLIN) {
if ((cli = accept(srv, NULL, 0)) < 0) {
warn("Could not add socket");
continue;
}
epoll_join(ep, cli, EPOLLIN);
}
if (ev->events & EPOLLERR || ev->events & EPOLLHUP)
err(EX_OSERR, "Server FD has failed", ev->data.fd);
}
}
cleanup_old_events();
}
}
Here is the client:
from socket import *
import time
scks = list()
for i in range(0, 3000):
s = socket(AF_INET, SOCK_STREAM)
s.connect(("localhost", 8558))
scks.append(s)
time.sleep(600)
When running this on my local machine I get 6001 sockets using port 8558 (1 listening, 3000 client side sockets and 3000 server side sockets).
$ ss -ant | grep 8558 | wc -l
6001
When checking the number of IP connections connected on the client I get 3000.
# lsof -p$(pgrep python) | grep IPv4 | wc -l
3000
I've also tried the test with the server on a remote machine with success too.
I'd suggest you attempt to do the same.
In addition try turning off iptables completely just in case its some connection tracking quirk.
Sometimes the iptables option in /proc
can help too. So try sysctl -w net.netfilter.nf_conntrack_tcp_be_liberal=1
.
Edit: I've done another test which produces the output you see on your side. Your problem is that you are shutting down the connection on the server side pre-emptively.
I can duplicate results similar to what you are seeing doing the following:
shutdown(fd, SHUT_RD)
.send(fd, buf, sizeof(buf))
on the server.After doing this the following behaviours are seen.
lsof -i:8558 | grep -v ES
are in CLOSE_WAIT.This only happens on a half-shutdown connection.
As such I suspect this is a bug in your client or server program. Either you are sending something to the server which the server objects to, or the server is invalidly closing connections down for some reason.
You need to confirm that what state the "anomalous" connections in (like close_wait or something else).
At this stage I also consider this a programming problem and not really something that belongs on serverfault. Without seeing the relevant portions of the source for the client/server it is not going to be possible for anybody to track down the cause of the fault. Albeit I am pretty confident this is nothing to do with the way the operating system is handling the connections.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With