I'm using WebBrowser to get source of html pages . Our page source have some text and some html tags . like this :
FONT></P><P align=center><FONT color=#ccffcc size=3>**Hello There , This is a text in our html page** </FONT></P><P align=center> </P>
Html tags are random and we can not guess them . So is there any way to get texts only and separating them from html tags ?
you can use a TWebBrowser instance to parse and select the plaint text from html code.
see this sample
uses
MSHTML,
SHDocVw,
ActiveX;
function GetPlainText(Const Html: string): string;
var
DummyWebBrowser: TWebBrowser;
Document : IHtmlDocument2;
DummyVar : Variant;
begin
Result := '';
DummyWebBrowser := TWebBrowser.Create(nil);
try
//open an blank page to create a IHtmlDocument2 instance
DummyWebBrowser.Navigate('about:blank');
Document := DummyWebBrowser.Document as IHtmlDocument2;
if (Assigned(Document)) then //Check the Document
begin
DummyVar := VarArrayCreate([0, 0], varVariant); //Create a variant array to write the html code to the IHtmlDocument2
DummyVar[0] := Html; //assign the html code to the variant array
Document.Write(PSafeArray(TVarData(DummyVar).VArray)); //set the html in the document
Document.Close;
Result :=(Document.body as IHTMLBodyElement).createTextRange.text;//get the plain text
end;
finally
DummyWebBrowser.Free;
end;
end;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With