Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to parse HTML in Qt?

Tags:

c++

html

qt

How would I go about parsing all of the "a" html tags "href" properties on a page full of BAD html, in Qt?

like image 294
y2k Avatar asked Feb 01 '10 19:02

y2k


2 Answers

I would use the builtin QtWebKit. Don't know how it does in terms of performance, but I think it should catch all "bad" HTML. Something like:

class MyPageLoader : public QObject
{
  Q_OBJECT

public:
  MyPageLoader();
  void loadPage(const QUrl&);

public slots:
  void replyFinished(bool);

private:
  QWebView* m_view;
};

MyPageLoader::MyPageLoader()
{
  m_view = new QWebView();

  connect(m_view, SIGNAL(loadFinished(bool)),
          this, SLOT(replyFinished(bool)));
}

void MyPageLoader::loadPage(const QUrl& url)
{
  m_view->load(url);
}

void MyPageLoader::replyFinished(bool ok)
{
  QWebElementCollection elements = m_view->page()->mainFrame()->findAllElements("a");

  foreach (QWebElement e, elements) {
    // Process element e
  }
}

To use the class

MyPageLoader loader;
loader.loadPage("http://www.example.com")

and then do whatever you like with the collection.

like image 190
Jaro Avatar answered Oct 11 '22 17:10

Jaro



this question is already quite old. Nevertheless I hope this will help someone:

I wrote two small classes for Qt which I published under sourceforge. This will help you to access a html-file comparable you are used with XML.

Here you'll find the project:
http://sourceforge.net/projects/sgml-for-qt/
Here you'll find a help-system in the wiki.

Drewle

like image 40
drewle Avatar answered Oct 11 '22 17:10

drewle