Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Invalid parsing ARTICLE tag by MSHTML

I'm trying to parse HTML by MSHTML parser in Delphi 10 Seattle. It works fine , but ARTICLE tag confuse it, parsed ARTICLE element does not have innerHTML and children, although they are there.

    program Project1;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.SysUtils,
  Variants,
  ActiveX,
  MSHTML;

procedure DoParse;
var
  idoc: IHTMLDocument2;
  iCollection: IHTMLElementCollection;
  iElement: IHTMLElement;
  V: OleVariant;
  HTML: String;
  i: Integer;
begin
  Html :=
    '<html>'#10+
    '<head>'#10+
    '    <title>Articles</title>'#10+
    '</head>'#10+
    '<body>'#10+
    '    <article>'#10+
    '        <p>This is my Article</p>'#10+
    '    </article>'#10+
    '</body>'#10+
    '</html>';


  v := VarArrayCreate( [0,1], varVariant);
  v[0]:= Html;

  idoc := CoHTMLDocument.Create as IHTMLDocument2;
  idoc.designMode := 'on';
  idoc.write(PSafeArray(System.TVarData(v).VArray));
  idoc.close;

  iCollection := idoc.all as IHTMLElementCollection;
  for i := 0 to iCollection.length-1 do
  begin
    iElement := iCollection.item( i, 0) as IHTMLElement;
    if assigned(ielement) then
      WriteLN(iElement.tagName + ': ' + iElement.outerHTML);
  end;
end;

begin
  try
    DoParse;
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
  ReadLN;
end.

Output of program is

HTML: <HTML><HEAD><TITLE>Articles</TITLE>
<META name=GENERATOR content="MSHTML 11.00.9600.18283"></HEAD>
<BODY><ARTICLE>
<P>This is my Article</P></ARTICLE>undefined</BODY></HTML>
HEAD: <HEAD><TITLE>Articles</TITLE>
<META name=GENERATOR content="MSHTML 11.00.9600.18283"></HEAD>
TITLE: <TITLE>Articles</TITLE>
META:
<META name=GENERATOR content="MSHTML 11.00.9600.18283">
BODY:
<BODY><ARTICLE>
<P>This is my Article</P></ARTICLE>undefined</BODY>
ARTICLE: <ARTICLE>
P:
<P>This is my Article</P>
/ARTICLE: </ARTICLE>

As you see, there are errors with ARTICLE tag, it does not have content and /ARTICLE is defined as separate tag.

Can someone help me to understand this issue?

like image 715
Konstantin Knyazev Avatar asked May 18 '16 08:05

Konstantin Knyazev


1 Answers

See the docs: custom element | custom object.

The Windows Internet Explorer support for custom tags on an HTML page requires that a namespace be defined for the tag. Otherwise, the custom tag is treated as an unknown tag when the document is parsed. Although navigating to a page with an unknown tag in Internet Explorer does not result in an error, unknown tags have the disadvantage of not being able to contain other tags, nor can they have behaviors applied to them.

In your case ARTICLE is an unknown tag. To make it a custom tag which can contain other tags, you need to add namespace to it. e.g. <MY:ARTICLE> and declare the namespace <html XMLNS:MY> (if you do not declare the namespace the DOM parser will add it automatically)

See also: Using Custom Tags in Internet Explorer


In your comment you mentioned that your are trying to parse a live HTML5 page (You did not mentioned that in the question).
Since I'm not an HTML5 expert I did not associated ARTICLE tag to HTML5 standards.

Your program is running in IE7 compatibility mode by default, and therefore MSHTML does not know about this special tag and treating it as unknown tag.

So either try to add <!DOCTYPE html> as the first line of the HTML and add <meta http-equiv="X-UA-Compatible" content="IE=edge"> as the first line of the HEAD section (it must be first). Or try to add FEATURE_BROWSER_EMULATION registry key: How to have Delphi TWebbrowser component running in IE9 mode?

P.S: idoc.designMode := 'on'; is not needed.

like image 131
kobik Avatar answered Nov 17 '22 17:11

kobik