I'm trying to parse HTML by MSHTML parser in Delphi 10 Seattle. It works fine , but ARTICLE tag confuse it, parsed ARTICLE element does not have innerHTML and children, although they are there.
program Project1;
{$APPTYPE CONSOLE}
{$R *.res}
uses
System.SysUtils,
Variants,
ActiveX,
MSHTML;
procedure DoParse;
var
idoc: IHTMLDocument2;
iCollection: IHTMLElementCollection;
iElement: IHTMLElement;
V: OleVariant;
HTML: String;
i: Integer;
begin
Html :=
'<html>'#10+
'<head>'#10+
' <title>Articles</title>'#10+
'</head>'#10+
'<body>'#10+
' <article>'#10+
' <p>This is my Article</p>'#10+
' </article>'#10+
'</body>'#10+
'</html>';
v := VarArrayCreate( [0,1], varVariant);
v[0]:= Html;
idoc := CoHTMLDocument.Create as IHTMLDocument2;
idoc.designMode := 'on';
idoc.write(PSafeArray(System.TVarData(v).VArray));
idoc.close;
iCollection := idoc.all as IHTMLElementCollection;
for i := 0 to iCollection.length-1 do
begin
iElement := iCollection.item( i, 0) as IHTMLElement;
if assigned(ielement) then
WriteLN(iElement.tagName + ': ' + iElement.outerHTML);
end;
end;
begin
try
DoParse;
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
ReadLN;
end.
Output of program is
HTML: <HTML><HEAD><TITLE>Articles</TITLE>
<META name=GENERATOR content="MSHTML 11.00.9600.18283"></HEAD>
<BODY><ARTICLE>
<P>This is my Article</P></ARTICLE>undefined</BODY></HTML>
HEAD: <HEAD><TITLE>Articles</TITLE>
<META name=GENERATOR content="MSHTML 11.00.9600.18283"></HEAD>
TITLE: <TITLE>Articles</TITLE>
META:
<META name=GENERATOR content="MSHTML 11.00.9600.18283">
BODY:
<BODY><ARTICLE>
<P>This is my Article</P></ARTICLE>undefined</BODY>
ARTICLE: <ARTICLE>
P:
<P>This is my Article</P>
/ARTICLE: </ARTICLE>
As you see, there are errors with ARTICLE tag, it does not have content and /ARTICLE is defined as separate tag.
Can someone help me to understand this issue?
See the docs: custom element | custom object.
The Windows Internet Explorer support for custom tags on an HTML page requires that a namespace be defined for the tag. Otherwise, the custom tag is treated as an unknown tag when the document is parsed. Although navigating to a page with an unknown tag in Internet Explorer does not result in an error, unknown tags have the disadvantage of not being able to contain other tags, nor can they have behaviors applied to them.
In your case ARTICLE
is an unknown tag. To make it a custom tag which can contain other tags, you need to add namespace to it. e.g. <MY:ARTICLE>
and declare the namespace <html XMLNS:MY>
(if you do not declare the namespace the DOM parser will add it automatically)
See also: Using Custom Tags in Internet Explorer
In your comment you mentioned that your are trying to parse a live HTML5 page (You did not mentioned that in the question).
Since I'm not an HTML5 expert I did not associated ARTICLE
tag to HTML5 standards.
Your program is running in IE7 compatibility mode by default, and therefore MSHTML does not know about this special tag and treating it as unknown tag.
So either try to add <!DOCTYPE html>
as the first line of the HTML and add <meta http-equiv="X-UA-Compatible" content="IE=edge">
as the first line of the HEAD
section (it must be first). Or try to add FEATURE_BROWSER_EMULATION
registry key: How to have Delphi TWebbrowser component running in IE9 mode?
P.S: idoc.designMode := 'on';
is not needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With