I have an application that receives, processes, and transmits UDP packets.
Everything works fine if the port numbers for reception and transmission are different.
If the port numbers are the same and the IP addresses are different it usually works fine EXCEPT when the IP address are on the same subnet as the machine running the application. In that last case the send_to function requires several seconds to complete, instead of a few milliseconds as is usual.
Rx Port Tx IP Tx Port Result
5001 Same 5002 OK Delay ~ 0.001 secs
subnet
5001 Different 5001 OK Delay ~ 0.001 secs
subnet
5001 Same 5001 Fails Delay > 2 secs
subnet
Here is a short program that demonstrates the problem.
#include <ctime>
#include <iostream>
#include <string>
#include <boost/array.hpp>
#include <boost/asio.hpp>
using boost::asio::ip::udp;
using std::cout;
using std::endl;
int test( const std::string& output_IP)
{
try
{
unsigned short prev_seq_no;
boost::asio::io_service io_service;
// build the input socket
/* This is connected to a UDP client that is running continuously
sending messages that include an incrementing sequence number
*/
const int input_port = 5001;
udp::socket input_socket(io_service, udp::endpoint(udp::v4(), input_port ));
// build the output socket
const std::string output_Port = "5001";
udp::resolver resolver(io_service);
udp::resolver::query query(udp::v4(), output_IP, output_Port );
udp::endpoint output_endpoint = *resolver.resolve(query);
udp::socket output_socket( io_service );
output_socket.open(udp::v4());
// double output buffer size
boost::asio::socket_base::send_buffer_size option( 8192 * 2 );
output_socket.set_option(option);
cout << "TX to " << output_endpoint.address() << ":" << output_endpoint.port() << endl;
int count = 0;
for (;;)
{
// receive packet
unsigned short recv_buf[ 20000 ];
udp::endpoint remote_endpoint;
boost::system::error_code error;
int bytes_received = input_socket.receive_from(boost::asio::buffer(recv_buf,20000),
remote_endpoint, 0, error);
if (error && error != boost::asio::error::message_size)
throw boost::system::system_error(error);
// start timer
__int64 TimeStart;
QueryPerformanceCounter( (LARGE_INTEGER *)&TimeStart );
// send onwards
boost::system::error_code ignored_error;
output_socket.send_to(
boost::asio::buffer(recv_buf,bytes_received),
output_endpoint, 0, ignored_error);
// stop time and display tx time
__int64 TimeEnd;
QueryPerformanceCounter( (LARGE_INTEGER *)&TimeEnd );
__int64 f;
QueryPerformanceFrequency( (LARGE_INTEGER *)&f );
cout << "Send time secs " << (double) ( TimeEnd - TimeStart ) / (double) f << endl;
// stop after loops
if( count++ > 10 )
break;
}
}
catch (std::exception& e)
{
std::cerr << e.what() << std::endl;
}
}
int main( )
{
test( "193.168.1.200" );
test( "192.168.1.200" );
return 0;
}
The output from this program, when running on a machine with address 192.168.1.101
TX to 193.168.1.200:5001
Send time secs 0.0232749
Send time secs 0.00541566
Send time secs 0.00924535
Send time secs 0.00449014
Send time secs 0.00616714
Send time secs 0.0199299
Send time secs 0.00746081
Send time secs 0.000157972
Send time secs 0.000246775
Send time secs 0.00775578
Send time secs 0.00477618
Send time secs 0.0187321
TX to 192.168.1.200:5001
Send time secs 1.39485
Send time secs 3.00026
Send time secs 3.00104
Send time secs 0.00025927
Send time secs 3.00163
Send time secs 2.99895
Send time secs 6.64908e-005
Send time secs 2.99864
Send time secs 2.98798
Send time secs 3.00001
Send time secs 3.00124
Send time secs 9.86207e-005
Why is this happening? Is there any way I can reduce the delay?
Notes:
Built using code::blocks, running under various flavours of Windows
Packet are 10000 bytes long
The problem goes away if I connect the computer running the application to a second network. For example a WWLAN ( cellular network "rocket stick" )
As far as I can tell, this is the situation we have:
This works ( different ports, same LAN ):
This also works ( same ports, different LANS ):
This does NOT work ( same ports, same LAN ):
This seems to work ( same ports, same LAN, dual homed Module2 host )
Given this is being observed on Windows for large datagrams with a destination address of a non-existent peer within the same subnet as the sender, the problem is likely the result of send()
blocking waiting for an Address Resolution Protocol (ARP) response so that the layer2 ethernet frame can populated:
When sending data, the layer2 ethernet frame will be populated with the media access control (MAC) Address of the next hop in the route. If the sender does not know the MAC Address for the next hop, it will broadcast an ARP request and cache responses. Using the sender's subnet mask and the destination address, the sender can determine if the next hop is on the same subnet as the sender or if the data must route through the default gateway. Based on the results in the question, when sending large datagrams:
The socket's send buffer size (SO_SNDBUF
) is being set to 16384
bytes, but the size of datagrams being sent are 10000
. It is unspecified as to the behavior behavior of send()
when the buffer is saturated, but some systems will observe send()
blocking. In this case, saturation would occur fairly quickly if any datagrams incur a delay, such as by waiting for an ARP response.
// Datagrams being sent are 10000 bytes, but the socket buffer is 16384.
boost::asio::socket_base::send_buffer_size option(8192 * 2);
output_socket.set_option(option);
Consider letting the kernel manage the socket buffer size or increasing it based on your expected throughput.
When sending a datagram with a size that exceeds the Window's registry FastSendDatagramThreshold
parameter, the send()
call can block until the datagram has been sent. For more details, see the Microsoft TCP/IP Implementation Details:
Datagrams smaller than the value of this parameter go through the fast I/O path or are buffered on send. Larger ones are held until the datagram is actually sent. The default value was found by testing to be the best overall value for performance. Fast I/O means copying data and bypassing the I/O subsystem, instead of mapping memory and going through the I/O subsystem. This is advantageous for small amounts of data. Changing this value is not generally recommended.
If one is observing delays for each send()
to an existing peer on the sender's subnet, then profile and analyze the network:
Also note that sending datagrams below the FastSendDatagramThreshold
value in quick succession while waiting for ARP to resolve may cause datagrams to be discarded:
ARP queues only one outbound IP datagram for a specified destination address while that IP address is being resolved to a media access control address. If a User Datagram Protocol (UDP)-based application sends multiple IP datagrams to a single destination address without any pauses between them, some of the datagrams may be dropped if there is no ARP cache entry already present. An application can compensate for this by calling the
iphlpapi.dll
routineSendArp()
to establish an ARP cache entry, before sending the stream of packets.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With