Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get XML (RAW/SOURCE) from a WebBrowser Control

I am using the WebBrowser Control in my both Delphi and .Net C# test projects to navigate to a local test XML file and try to save the content back to a XML file in .Net DocumentCompleted Event and in Delphi onNavigateComple2 event.

The Problem is that I always get HTML which will be transformed by Browser for viewing (check my output: I saved that with using the following code)

procedure TForm1.SaveHTMLSourceToFile(const FileName: string;
  WB: TWebBrowser);
var
  PersistStream: IPersistStreamInit;
  FileStream: TFileStream;
  Stream: IStream;
  SaveResult: HRESULT;
begin
  PersistStream := WB.Document as IPersistStreamInit;
  FileStream := TFileStream.Create(FileName, fmCreate);
  try
    Stream := TStreamAdapter.Create(FileStream, soReference) as IStream;
    SaveResult := PersistStream.Save(Stream, True);
    if FAILED(SaveResult) then
      MessageBox(Handle, 'Fail to save source', 'Error', 0);
  finally
    FileStream.Free;
  end;
end;

Well, I've tried almost everything, searched everywhere but till now couldn't find anyhting useful. With the following Delphi Code I manged to show the SOURCE which works (That means the source is somewhere there) but I can not use this since it will sow a dialog and not easy to get the data and close that dialog (in my test case I get the notepad.exe with my xml content)

  AWebBrowser.Document.QueryInterface(IOleCommandTarget, CmdTarget) ;
  if CmdTarget <> nil then
  try
    CmdTarget.Exec(PtrGUID, HTMLID_VIEWSOURCE, 0, vaIn, vaOut) ;
  finally
    CmdTarget._Release;
  end;

I also managed to call the SAVE AS call with the xxx-HIDE-xxx Flag, but it seams up IE 5 the save as Dialog will be shown (the hide flag will be ignored).

I also tried to get the XML Data from Cache (Cache API) but in my case I won't get anything and 2. what if on customer machine the caching is disabled? ;-)

InnerText or InnerHTML atc. can not be used, since they contain - and + char and not representing the orignial RAW data (the SOURCE)

Just for your information: There is no way for me to use WebClient or Indy components to access the xml. I also can't play as a Proxy since the Problem with the opening the ports (let say 8080) on customers machine is painful with privileged user access.

So here I am and asking you if you have any idea how to solve my Problem?

Thanks in advance, Cheers

input:

<?xml version="1.0" encoding="UTF-8"?>
<test><data>xxxx</data></test>

output:

<HTML><HEAD>
<STYLE>BODY{font:x-small 'Verdana';margin-right:1.5em}
.c{cursor:hand}
.b{color:red;font-family:'Courier New';font-weight:bold;text-decoration:none}
.e{margin-left:1em;text-indent:-1em;margin-right:1em}
.k{margin-left:1em;text-indent:-1em;margin-right:1em}
.t{color:#990000}
.xt{color:#990099}
.ns{color:red}
.dt{color:green}
.m{color:blue}
.tx{font-weight:bold}
.db{text-indent:0px;margin-left:1em;margin-top:0px;margin-bottom:0px;padding-left:.3em;border-left:1px solid #CCCCCC;font:small Courier}
.di{font:small Courier}
.d{color:blue}
.pi{color:blue}
.cb{text-indent:0px;margin-left:1em;margin-top:0px;margin-bottom:0px;padding-left:.3em;font:small Courier;color:#888888}
.ci{font:small Courier;color:#888888}
PRE{margin:0px;display:inline}</STYLE>
<SCRIPT><!--
function f(e){
if (e.className=="ci"){if (e.children(0).innerText.indexOf("\n")>0) fix(e,"cb");}
if (e.className=="di"){if (e.children(0).innerText.indexOf("\n")>0) fix(e,"db");}
e.id="";
}
function fix(e,cl){
e.className=cl;
e.style.display="block";
j=e.parentElement.children(0);
j.className="c";
k=j.children(0);
k.style.visibility="visible";
k.href="#";
}
function ch(e){
mark=e.children(0).children(0);
if (mark.innerText=="+"){
mark.innerText="-";
for (var i=1;i<e.children.length;i++)
e.children(i).style.display="block";
}
else if (mark.innerText=="-"){
mark.innerText="+";
for (var i=1;i<e.children.length;i++)
e.children(i).style.display="none";
}}
function ch2(e){
mark=e.children(0).children(0);
contents=e.children(1);
if (mark.innerText=="+"){
mark.innerText="-";
if (contents.className=="db"||contents.className=="cb")
contents.style.display="block";
else contents.style.display="inline";
}
else if (mark.innerText=="-"){
mark.innerText="+";
contents.style.display="none";
}}
function cl(){
e=window.event.srcElement;
if (e.className!="c"){e=e.parentElement;if (e.className!="c"){return;}}
e=e.parentElement;
if (e.className=="e") ch(e);
if (e.className=="k") ch2(e);
}
function ex(){}
function h(){window.status=" ";}
document.onclick=cl;
--></SCRIPT>
</HEAD>
<BODY class="st"><DIV class="e">
<SPAN class="b">&nbsp;</SPAN>
<SPAN class="m">&lt;?</SPAN><SPAN class="pi">xml version="1.0" encoding="UTF-8" </SPAN><SPAN class="m">?&gt;</SPAN>
</DIV>
<DIV class="e">
<DIV class="c" STYLE="margin-left:1em;text-indent:-2em"><A href="#" onclick="return false" onfocus="h()" class="b">-</A>
<SPAN class="m">&lt;</SPAN><SPAN class="t">test</SPAN><SPAN class="m">&gt;</SPAN></DIV>
<DIV><DIV class="e"><DIV STYLE="margin-left:1em;text-indent:-2em">
<SPAN class="b">&nbsp;</SPAN>
<SPAN class="m">&lt;</SPAN><SPAN class="t">data</SPAN><SPAN class="m">&gt;</SPAN><SPAN class="tx">xxxx</SPAN><SPAN class="m">&lt;/</SPAN><SPAN class="t">data</SPAN><SPAN class="m">&gt;</SPAN>
</DIV></DIV>
<DIV><SPAN class="b">&nbsp;</SPAN>
<SPAN class="m">&lt;/</SPAN><SPAN class="t">test</SPAN><SPAN class="m">&gt;</SPAN></DIV>
</DIV></DIV>
</BODY>
</HTML>
like image 593
Gohlool Avatar asked May 26 '11 12:05

Gohlool


2 Answers

I think you're approaching this the wrong way. A TWebBrowser control is a visual control intended for viewing. You may be able to extract the underlying data from it, but fundamentally, using visual control to download something (a non-visual action) is not a good approach. Instead, you should download the file using a dedicated API.

Just for your information: There is no way for me to use WebClient or Indy components to access the xml. I also can't play as a Proxy since...

Don't you have those components? In that case, I'd suggest you use either of the following approaches:

  1. TDownloadURL is an inbuilt class, useful for simple downloading of a file. Some examples of using it:

    • An HTML page scraper - obviously also applicable to XML
    • How to show a progress indicator while downloading - may not be useful if your file is small
  2. InternetReadFile. This is what I personally use in my own code - I have a small thread class to asynchronously download files and notify the main thread when they're done, implemented using this function. Use it by:

    • Use InternetOpen to initialise use of the internet functions; it returns a handle;
    • Use that handle to get another handle using InternetOpenUrl using the INTERNET_FLAG_HYPERLINK or INTERNET_FLAG_NO_UI flags
    • Then use that handle with InternetReadFile in a loop writing to a buffer until the file is read or your thread is terminated.
    • Don't forget to close the handles using InternetCloseHandle

    Sorry I can't post source code, but they're simple functions and you should find it easy enough to write.

These approaches will get your either a file or a buffer, each containing the raw contents of your XML file.

Edit: I see you explained a bit about why you can't use Indy:

"The real scenario is much complex and need user interaction in the browser and after the user did everything there are some post posts between browser and user till the end result is a XML file which you have no control on where is comes from!"

I'm not certain this stops you using Indy: instead you just need to get the location of this XML. The fact you don't control where it is doesn't matter, you just need to find out where it is. Either scrape the HTML if all you have is a link (you can already get HTML from the browser - in fact, that's your problem!) or look at the final location the TWebBrowser document is located at, and download that. In other words, let the user do whatever they have to do to navigate to the final XML file, but rather than trying to extract it from the web browser control, download it yourself.

like image 93
David Avatar answered Sep 24 '22 18:09

David


You could do a "shadow" download of the file in the TWebBrowser BeforeNavigate2 event.
By shadow, I mean use a procedure from another library to download the file at the same time TWebBrowser is downloading it. This way, you can get the file without it being modified by TWebBrowser.

I wrote a test application and all I had to do the get the file contents is

procedure TForm1.WebBrowserBeforeNavigate2(Sender: TObject;
  const pDisp: IDispatch; var URL, Flags, TargetFrameName, PostData,
  Headers: OleVariant; var Cancel: WordBool);
begin
  HttpGetText(URL,Memo1.Lines);
end;

The HttpGetText is a blocking function from the Synapse library http://www.ararat.cz/synapse/doku.php/start

You could also use ICS, Indy, or TDownLoadURL. Note, TDownLoadURL is not blocking and I was never able to get its AfterDownload event to work.

like image 36
crefird Avatar answered Sep 24 '22 18:09

crefird