Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web page scraping in Delphi

DO you know a library for Web page scraping for Delphi. Like Beautiful Soup or Scrapy for Python ?

like image 963
philnext Avatar asked Feb 04 '13 17:02

philnext


People also ask

Can Jupyter notebook do web scraping?

Web Scraping using Beautiful Soup. Using Jupyter Notebook, you should start by importing the necessary modules (pandas, numpy, matplotlib. pyplot, seaborn). If you don't have Jupyter Notebook installed, I recommend installing it using the Anaconda Python distribution which is available on the internet.

Which language is best for scraping?

Python. The most popular language for scraping data from the web. Python is one of the easiest to master with a gentler learning curve. Its statements and commands are very similar to the English language.

What is Octoparse tool?

Octoparse is a precise tool for the web scraping purpose. Not only does it save the amount of time for downloading the exact set of data that you want, but it also intelligently exports data into a structured format such as a spreadsheet or database.

Is Node JS good for scraping?

Web scraping is the process of extracting data from a website in an automated way and Node. js can be used for web scraping. Even though other languages and frameworks are more popular for web scraping, Node. js can be utilized well to do the job too.


2 Answers

Well, it's not for Delphi, but for FreePascal, since I do not have a recent Delphi version, but porting between them is supposed to be not so difficult.

Anyways, my Internet Tools are probably the best Pascal web scraping library that are out there.

You can, e.g. print all links on a page with:

uses simpleinternet, xquery;

var a: IXQValue;
begin
  for a in process('http://stackoverflow.com', '//a/@href') do
    writeln(a.toString);
end.

They are platform independent; have full support for XPath 2, XQuery, CSS 3 selectors (those are not so well tested through, XPath is better anyways) and pattern-matching; parse xml and html; and download over http and https.

like image 117
BeniBela Avatar answered Oct 27 '22 00:10

BeniBela


After the page is loaded with TWebBrowser component, query the TWebBrowser.Document property for the IHTMLDocument2 interface and then you can enumerate the elements.

You can getElementsById, getElementsByTagName, getElementsByName, for example:

var
  Elem: IHTMLElement;
begin
   Elem := GetElementById(WebBrowser1.Document, 'myid') as IHTMLElement;
end;

or get all HTML text and use any way you want, for example:

sourceHTML := WebBrowser.Document as IHTMLDocument2;
sourceHTML.body.innerHTML;
like image 35
Leonardo Gregianin Avatar answered Oct 27 '22 00:10

Leonardo Gregianin