Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse HTTP directory listing

Good day! I'm using Delphi XE and Indy TIdHTTP. Using Get method I get remote directory listing and I need to parse it = get list of files with their sizes and timestamps and distinguish files and subdirectories. Please, is there a good routine to do that? Thank you in advance! Vojtech

Here is the sample:

<head>
  <title>127.0.0.1 - /</title>
</head>
<body>
  <H1>127.0.0.1 - /</H1><hr>
<pre>      
  Mittwoch, 30. März 2011    12:01        &lt;dir&gt; <A HREF="/SubDir/">SubDir</A><br />
  Mittwoch, 9. Februar 2005    17:14          113 <A HREF="/file.txt">file.txt</A><br />
</pre>
<hr>
</body>
like image 541
Vojtech Avatar asked Feb 23 '12 13:02

Vojtech


2 Answers

Given the code sample, I guess the fastest way to parse it would be like this:

  • Identify the <pre>...</pre> block containing all the listing lines. Should be easy.
  • Put everything between the <pre> and </pre> into a TStringList. Each line is a file or folder, and the format is very simple.
  • Extract the links from each line, extract the date, time and size if you need it. Best done with a regex (you've got Delphi XE so you've got built-in Regex).
like image 151
Cosmin Prund Avatar answered Oct 13 '22 18:10

Cosmin Prund


This should give you a good start and idea using DOM:

uses
  MSHTML,
  ActiveX,
  ComObj;

procedure DocumentFromString(Document: IHTMLDocument2; const S: WideString);
var
  v: OleVariant;
begin
  v := VarArrayCreate([0, 0], varVariant);
  v[0] := S;
  Document.Write(PSafeArray(TVarData(v).VArray));
  Document.Close;
end;

function StripMultipleChar(const S: string; const C: Char): string;
begin
  Result := S;
  while Pos(C + C, Result) <> 0 do
    Result := StringReplace(Result, C + C, C, [rfReplaceAll]);
end;

procedure TForm1.Button1Click(Sender: TObject);
var
  Document: IHTMLDocument2;
  Elements: IHTMLElementCollection;
  Element: IHTMLElement;
  I: Integer;
  Line: string;
begin
  Document := CreateComObject(CLASS_HTMLDocument) as IHTMLDocument2;
  DocumentFromString(Document, '<head>...'); // your HTML here

  Elements := Document.all.tags('A') as IHTMLElementCollection;
  for I := 0 to Elements.length - 1 do
  begin
    Element := Elements.item(I, '') as IHTMLElement;
    Memo1.Lines.Add('A HREF=' + Element.getAttribute('HREF', 2));
    Memo1.Lines.Add('A innerText=' + Element.innerText);

    // Text is returned immediately before the element
    Line := (Element as IHTMLElement2).getAdjacentText('beforeBegin');

    // Line => "Mittwoch, 30. März 2011 12:01 <dir>" OR:
    // Line => "Mittwoch, 9. Februar 2005 17:14 113"...
    // I don't know what is the actual delimiter:
    // It could be [space] or [tab] so we need to normalize the Line
    // If it's tabs then it's easier because the timestamps also contains spaces

    Line := Trim(Line);
    Line := StripMultipleChar(Line, #32); // strip multiple Spaces sequences
    Line := StripMultipleChar(Line, #9);  // strip multiple Tabs sequences

    // TODO: ParseLine (from right to left)

    Memo1.Lines.Add(Line);
    Memo1.Lines.Add('-------------');
  end;
end;

Output:

A HREF=/SubDir/
A innerText=SubDir
Mittwoch, 30. März 2011 12:01 <dir>
-------------
A HREF=/file.txt
A innerText=file.txt
Mittwoch, 9. Februar 2005 17:14 113
-------------

EDIT:
I have changed StripMultipleChar implementation to be more simplified. yet I belive the former version was more optimized to speed. considering the fact that the Lines are very short in length, there will be no much differences in performance.

like image 36
kobik Avatar answered Oct 13 '22 20:10

kobik