Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the "Text" of a html page ? (Webbrowser - Delphi)

I'm using WebBrowser to get source of html pages . Our page source have some text and some html tags . like this :

FONT></P><P align=center><FONT color=#ccffcc size=3>**Hello There , This is a text in our html page** </FONT></P><P align=center> </P>

Html tags are random and we can not guess them . So is there any way to get texts only and separating them from html tags ?

like image 813
Kermia Avatar asked Dec 16 '22 21:12

Kermia


1 Answers

you can use a TWebBrowser instance to parse and select the plaint text from html code.

see this sample

uses
MSHTML,
SHDocVw,
ActiveX;

function GetPlainText(Const Html: string): string;
var
DummyWebBrowser: TWebBrowser;
Document       : IHtmlDocument2;
DummyVar       : Variant;
begin
   Result := '';
   DummyWebBrowser := TWebBrowser.Create(nil);
   try
     //open an blank page to create a IHtmlDocument2 instance
     DummyWebBrowser.Navigate('about:blank');
     Document := DummyWebBrowser.Document as IHtmlDocument2; 
     if (Assigned(Document)) then //Check the Document
     begin
       DummyVar      := VarArrayCreate([0, 0], varVariant); //Create a variant array to write the html code to the  IHtmlDocument2
       DummyVar[0]   := Html; //assign the html code to the variant array
       Document.Write(PSafeArray(TVarData(DummyVar).VArray)); //set the html in the document
       Document.Close;
       Result :=(Document.body as IHTMLBodyElement).createTextRange.text;//get the plain text
     end;
   finally
     DummyWebBrowser.Free;
   end;
end;
like image 95
RRUZ Avatar answered Dec 29 '22 16:12

RRUZ