Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

best way to extract info from the web delphi

I want to know if there is a better way of extracting info from a web page than parsing the HTML for what i'm searching. ie: Extracting movie rating from 'imdb.com'

I'm currently using the IndyHttp components to get the page and i'm using strUtils to parse the text but the content is limited.

like image 687
Gab Avatar asked Jan 13 '12 00:01

Gab


5 Answers

I found plain simple regex-es to be highly intuitive and simple when dealing with good web-sites, and IMDB is a good web site.

For example the movie rating on the IMDB's movie HTML page is in a <DIV> with class="star-box-giga-star". That's VERY easy to extract using a regular expression. The following regular expression will extract the movie rating from the raw HTML into capture group 1:

star-box-giga-star[^>]*>([^<]*)<

It's not pretty, but it does the job. The regex looks for the "star-box-giga-star" class id, then it looks for the > that terminates the DIV, and then captures everything until the following <. To create a new regex like this you should use a web browser that allows inspecting elements (for example Crome or Opera). With Chrome you can simply look at the web-page, right-click on the element you want to capture and do Inspect element, then look around for easily identifiable elements that can be used to create a good regex. In this case the "star-box-giga-star" class is obviously easily identifiable! You'll usually have no problem finding such identifiable elements on good web sites because good web sites use CSS and CSS requires ID's or class'es to be able to style the elements properly.

like image 107
Cosmin Prund Avatar answered Nov 18 '22 06:11

Cosmin Prund


Processing RSS feed is more comfortable.

As of the time of posting, the only RSS feeds available on the site are:

  • Born on this Date
  • Died on this Date
  • Daily Poll

Yet, you may make a call for adding a new one by getting in touch with the help desk.

Resources on RSS feed processing:

  • Relevant post here on SO.
  • Super Object
  • Wikipedia.
like image 21
menjaraz Avatar answered Nov 18 '22 06:11

menjaraz


When scraping websites, you cannot rely on the availability of the information. IMDB may detect your scraping and attempt to block you, or they may frequently change the format to make it more difficult.

Therefore, you should always try to use a supported API Or RSS feed, or at least get permission from the web site to aggregate their data, and ensure that you're abiding by their terms. Often, you will have to pay for this type of access. Scraping a website without permission may open you up to liability on a couple legal fronts (Denial of Service and Intellectual Property).

Here's IMDB's statement:

You may not use data mining, robots, screen scraping, or similar online data gathering and extraction tools on our website.

To answer your question, the better way is to use the method provided by the website. For non-commercial use, and if you abide by their terms, you can download the IMDB database directly and use the data from there instead of scraping their site. Simply update your database frequently, and it's a better solution than scraping the site. You could even wrap your own web API around it. Ratings are available as a standalone table.

like image 24
Marcus Adams Avatar answered Nov 18 '22 06:11

Marcus Adams


Use HTML Tidy to convert any HTML to valid XML and then use an XML parser, maybe using XPATH or developing your own code (which is what I do).

like image 27
Misha Avatar answered Nov 18 '22 05:11

Misha


All the answers posted cover well your generic question. I usually follow an strategy similar to the one detailed by Cosmin. I use wininet and regex for most of my web extraction needs.

But let me add my two cents at the specific subquestion on extracting imdb qualification. IMDBAPI.COM provides a query interface returning json code, which is very handy for this type of searches.

So a very simple command line program for getting a imdb rating would be...

program imdbrating;
{$apptype console}
uses htmlutils;

function ExtractJsonParm(parm,h:string):string;
 var r:integer;
 begin
  r:=pos('"'+Parm+'":',h);
  if r<>0 then 
    result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2)
  else
    result:='N/A';
 end;
    
var h:string;
begin
  h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1)));
  writeln(ExtractJsonParm('Rating',h));
end.
like image 2
PA. Avatar answered Nov 18 '22 07:11

PA.