Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Really weird HTTP client using TcpClient in C#

I am implementing a simple HTTP Client that just connects to a web server and gets its default homepage. Here it is and it works nice:

using System;
using System.Net.Sockets;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            TcpClient tc = new TcpClient();
            tc.Connect("www.google.com", 80);

            using (NetworkStream ns = tc.GetStream())
            {
                System.IO.StreamWriter sw = new System.IO.StreamWriter(ns);
                System.IO.StreamReader sr = new System.IO.StreamReader(ns);

                string req = "";
                req += "GET / HTTP/1.0\r\n";
                req += "Host: www.google.com\r\n";
                req += "\r\n";

                sw.Write(req);
                sw.Flush();

                Console.WriteLine("[reading...]");
                Console.WriteLine(sr.ReadToEnd());
            }
            tc.Close();
            Console.WriteLine("[done!]");
            Console.ReadKey();
        }
    }
}

When I delete the below line from above code, the program blocks on sr.ReadToEnd.

req += "Host: www.google.com\r\n";

I even replaced sr.ReadToEnd with sr.Read, but it cannot read anything. I used Wireshark to see what's happen:

Screenshot of captured packets using Wireshark http://www.imagechicken.com/uploads/1252514718052893500.jpg

As you see, after my GET request Google doesn't respond and the request is retransmitted again and again. It seems we HAVE TO specify the Host part in HTTP request. The weird part is WE DON'T. I used telnet to send this request and got the respond from Google. I also captured the request sent by telnet and it was exactly same as my request.

I tried many other websites (e.g. Yahoo, Microsoft) but the result is same.

So, does the delay in telnet cause the web-server act differently (because in telnet we actually type characters instead of sending them together in 1 packet).


Another weird problem is when I change HTTP/1.0 to HTTP/1.1, the program always blocks on sr.ReadToEnd line. I guess that's because the web server don't close the connection.

One solution is using Read (or ReadLine) and ns.DataAvailable to read the response. But I cannot be sure that I have read all of the response. How I can read the response and be sure there is no more bytes left in the response of a HTTP/1.1 request?


Note: As W3 says,

the Host request-header field MUST accompany all HTTP/1.1 requests

(and I did it for my HTTP/1.1 requests). But I haven't seen such thing for HTTP/1.0. Also sending a request without Host header using telnet works without any problem.


Update:

Push flag has been set to 1 in the TCP segment. I also have tried netsh winsock reset to reset my TCP/IP stack. There is no firewalls nor anti-viruses on the testing computer. The packet are actually sent because Wireshark installed on another computer can capture it.

I also have tried some other requests. For Instance,

string req = "";
req += "GET / HTTP/1.0\r\n";
req += "s df slkjfd sdf/ s/fd \\sdf/\\\\dsfdsf \r\n";
req += "qwretyuiopasdfghjkl\r\n";
req += "Host: www.google.com\r\n";
req += "\r\n";

In all kind of requests, if I omit the Host: part, the web-server doesn't respond and if with a Host: part, even an invalid request (just like the above request) will be responded (by a 400: HTTP Bad Request).

nos says the Host: part is not required on his machine, and this makes the situation more weird.

like image 490
Isaac Avatar asked Sep 09 '09 16:09

Isaac


2 Answers

This pertains to using TcpClient.

I know this post is old. I am providing this information just in case anyone else comes across this. Consider this answer a supplement to all of the above answers.

The HTTP host header is required by some servers since they are setup to host more than one domain per IP address. As a general rule, always sent the Host header. A good server will reply with "Not Found". Some servers won't reply at all.

When the call to read data from the stream blocks, it's usually because the server is waiting for more data to be sent. This is typically the case when the HTTP 1.1 spec is not followed closely. To demonstrate this, try omitting the final CR LF sequence and then read data from the stream - the call to read will wait until either the client times out or the server gives up waiting by terminating the connection.

I hope this sheds a bit of light...

like image 116
Sam Changtum Avatar answered Oct 14 '22 17:10

Sam Changtum


I found one question in all that:

How i can read the response and be sure i read all of the response in HTTP/1.1 request?

And that is a question I can answer!

All the methods you're using here are synchronous, which is easy to use but not even slightly reliable. You'll see problems as soon as you have a sizable response and only get part of it.

To implement a TcpClient connection most robustly, you should use all asynchronous methods and callbacks. The relevant methods are as follows:

1) Create the connection with TcpClient.BeginConnect(...) with the callback calling TcpClient.EndConnect(...)
2) Send a request with TcpClient.GetStream().BeginWrite(...) with the callback calling TcpClient.GetStream().EndWrite(...)
3) Receive a response with TcpClient.GetStream().BeginRead(...) with the callback calling TcpClient.GetStream().EndRead(...), appending the result to a StringBuilder buffer, and then calling TcpClient.GetStream().BeginRead(...) again (with the same callback) until a response of 0 bytes is received.

It's that final step (repeatedly calling BeginRead until 0 bytes are read) that solves the problem of fetching the response, the whole response, and nothing but the response. So help us TCP.

Hope that helps!

like image 26
Task Avatar answered Oct 14 '22 15:10

Task