I am developing a C# app that gets web pages and processes their contents line by line. To do this, I use the HttpClient
class, and read the page contents through ReadAsStreamAsync()
. Then I read the stream into a line array and iterate over it. So far so good.
However, the HTML that I obtain with this method is not identical to the HTML that I observe if I navigate to the web page using Chrome or Edge and use View Source to get to the HTML. In particular, the __VIEWSTATE and __VIEWSTATEGENERATOR hidden input
elements are surrounded by div
elements with class="aspNetHidden"
when I use the browser, but not when I get the HTML programmatically. This ruins my line tracking logic as there are extra lines in the page as seen by the browser in relation to the page I am getting in code.
EDIT. After some testing, I am confident that the user agent header employed by the client is what determines whether or not the class="aspNetHidden"
div
is served. When I mimic my browser's user agent ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 Edg/83.0.478.37"), the div
is served; if I use some other agent such as "Test Client", the div
is not served.
My question then is, is there any documentation on what user agent strings cause the div
to be served and which don't? Also, can I prevent this from happening?
Thanks.
In short, it is not documented/specified in terms of useragents, but browser capabilities.
Based on the browsers useragent a set of capabilities gets set up.
These capabilities are configured in .browser
configuration files on the webserver.
For e.g. .NET 4
you find these files in %SystemRoot%\Microsoft.NET\Framework\v4.0.30319\config\browsers
,
e.g. chrome.browser
, iphone.browser
, etc.
Such a .browser
file contains a tagwriter
capability.
E.g. chrome.browser
:
<browsers>
<!-- Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/530.1 (KHTML, like Gecko) Chrome/2.0.168.0 Safari/530.1 -->
<browser id="Chrome" parentID="WebKit">
<identification>
<userAgent match="Chrome/(?'version'(?'major'\d+)(\.(?'minor'\d+)?)\w*)" />
</identification>
<capabilities>
<capability name="browser" value="Chrome" />
<capability name="tagwriter" value="System.Web.UI.HtmlTextWriter" />
<!-- ... -->
</capabilities>
</browser>
</browsers>
The tagwriter
capability specifies whether a System.Web.UI.HtmlTextWriter
or a System.Web.UI.Html32TextWriter
will be be instantiated to write the output.
The default configuration in the Default.browser
file, declares tagwriter
as:
<capability name="tagwriter" value="System.Web.UI.Html32TextWriter" />
Also, if the tagwriter
capability is missing a Html32TextWriter
is being used.
From the Microsoft reference source:
internal HtmlTextWriter CreateHtmlTextWriterInternal(TextWriter tw) {
Type tagWriter = TagWriter;
if (tagWriter != null) {
return Page.CreateHtmlTextWriterFromType(tw, tagWriter);
}
// Fall back to Html 3.2
return new Html32TextWriter(tw);
}
The Html32TextWriter
declares not to render a div
around hidden input fields.
From the Microsoft reference source:
internal override bool RenderDivAroundHiddenInputs {
get {
return false;
}
}
The HtmlTextWriter
does return true
for RenderDivAroundHiddenInputs
,
see the Microsoft reference source.
Some more reading about all this here.
What you can do.
If you always want the wrapping div
, use one of the wellknown useragents, otherwise use a custom one like the Test Client
you are already using.
If you control the website being requested, you can set up a custom .browser
file for your custom useragent ... but I would rather not go that way ...
When making the request, just set the appropriate User-Agent
request header on your HttpClient
, e.g.:
var client = new HttpClient();
var userAgent = "Test Client"; // Or "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 Edg/83.0.478.37"
client.DefaultRequestHeaders.Add("User-Agent", userAgent);
This can happen for a number of reasons one of most likely ones is the one that @thangadurai mentioned There may be a script which gets executed onload of the html and changes the html content.
. This could be avoided by using a UI testing framework such as Selenium or using headless Chrome programmatically.
One of the other possible reasons is the User-Agent
dependant implementation. This can be simply solved by changing the User-Agent
header.
EDIT: If you control the webpage you could probably disable ViewState if that's the case. The behavior might be based on detecting the User-Agent
capabilities. For your processing, you could go with either string and make it static when you send the request, though it might not be as reliable. Another method to the processing without parsing could be using a regular expression to match specific tags. The specifics of the deciding on rendering ViewState were nicely described by @pfx here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With