Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get webpage title without downloading all the page source

I'm looking for a method that will allow me to get the title of a webpage and store it as a string.

However all the solutions I have found so far involve downloading the source code for the page, which isn't really practical for a large number of webpages.

The only way I could see would be to limit the length of the string or it only downloads either a set number of chars or stops once it reaches the tag, however this obviously will still be quite large?

Thanks

like image 629
quotidian Avatar asked Jul 25 '12 15:07

quotidian


2 Answers

As the <title> tag is in the HTML itself, there will be no way to not download the file to find "just the title". You should be able download a portion of the file until you've read in the <title> tag, or the </head> tag and then stop, but you'll still need to download (at least a portion of) the file.

This can be accomplished with HttpWebRequest/HttpWebResponse and reading in data from the response stream until we've either read in a <title></title> block, or the </head> tag. I added the </head> tag check because, in valid HTML, the title block must appear within the head block - so, with this check we will never parse the entire file in any case (unless there is no head block, of course).

The following should be able to accomplish this task:

string title = "";
try {
    HttpWebRequest request = (HttpWebRequest.Create(url) as HttpWebRequest);
    HttpWebResponse response = (request.GetResponse() as HttpWebResponse);

    using (Stream stream = response.GetResponseStream()) {
        // compiled regex to check for <title></title> block
        Regex titleCheck = new Regex(@"<title>\s*(.+?)\s*</title>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
        int bytesToRead = 8092;
        byte[] buffer = new byte[bytesToRead];
        string contents = "";
        int length = 0;
        while ((length = stream.Read(buffer, 0, bytesToRead)) > 0) {
            // convert the byte-array to a string and add it to the rest of the
            // contents that have been downloaded so far
            contents += Encoding.UTF8.GetString(buffer, 0, length);

            Match m = titleCheck.Match(contents);
            if (m.Success) {
                // we found a <title></title> match =]
                title = m.Groups[1].Value.ToString();
                break;
            } else if (contents.Contains("</head>")) {
                // reached end of head-block; no title found =[
                break;
            }
        }
    }
} catch (Exception e) {
    Console.WriteLine(e);
}

UPDATE: Updated the original source-example to use a compiled Regex and a using statement for the Stream for better efficiency and maintainability.

like image 187
newfurniturey Avatar answered Sep 19 '22 12:09

newfurniturey


A simpler way to handle this would be to download it, then split:

    using System;
    using System.Net.Http;

    private async void getSite(string url)
    {
        HttpClient hc = new HttpClient();
        HttpResponseMessage response = await hc.GetAsync(new Uri(url, UriKind.Absolute));
        string source = await response.Content.ReadAsStringAsync();

        //process the source here

    }

To process the source, you can use the method described here in the article on Getting Content From Between HTML Tags

like image 20
user151243 Avatar answered Sep 20 '22 12:09

user151243