I'm looking for a method that will allow me to get the title of a webpage and store it as a string.
However all the solutions I have found so far involve downloading the source code for the page, which isn't really practical for a large number of webpages.
The only way I could see would be to limit the length of the string or it only downloads either a set number of chars or stops once it reaches the tag, however this obviously will still be quite large?
Thanks
As the <title>
tag is in the HTML itself, there will be no way to not download the file to find "just the title". You should be able download a portion of the file until you've read in the <title>
tag, or the </head>
tag and then stop, but you'll still need to download (at least a portion of) the file.
This can be accomplished with HttpWebRequest
/HttpWebResponse
and reading in data from the response stream until we've either read in a <title></title>
block, or the </head>
tag. I added the </head>
tag check because, in valid HTML, the title block must appear within the head block - so, with this check we will never parse the entire file in any case (unless there is no head block, of course).
The following should be able to accomplish this task:
string title = "";
try {
HttpWebRequest request = (HttpWebRequest.Create(url) as HttpWebRequest);
HttpWebResponse response = (request.GetResponse() as HttpWebResponse);
using (Stream stream = response.GetResponseStream()) {
// compiled regex to check for <title></title> block
Regex titleCheck = new Regex(@"<title>\s*(.+?)\s*</title>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
int bytesToRead = 8092;
byte[] buffer = new byte[bytesToRead];
string contents = "";
int length = 0;
while ((length = stream.Read(buffer, 0, bytesToRead)) > 0) {
// convert the byte-array to a string and add it to the rest of the
// contents that have been downloaded so far
contents += Encoding.UTF8.GetString(buffer, 0, length);
Match m = titleCheck.Match(contents);
if (m.Success) {
// we found a <title></title> match =]
title = m.Groups[1].Value.ToString();
break;
} else if (contents.Contains("</head>")) {
// reached end of head-block; no title found =[
break;
}
}
}
} catch (Exception e) {
Console.WriteLine(e);
}
UPDATE: Updated the original source-example to use a compiled Regex
and a using
statement for the Stream
for better efficiency and maintainability.
A simpler way to handle this would be to download it, then split:
using System;
using System.Net.Http;
private async void getSite(string url)
{
HttpClient hc = new HttpClient();
HttpResponseMessage response = await hc.GetAsync(new Uri(url, UriKind.Absolute));
string source = await response.Content.ReadAsStringAsync();
//process the source here
}
To process the source, you can use the method described here in the article on Getting Content From Between HTML Tags
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With