Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup like html parser for C++ [closed]

I have been writing some codes to get some data from some pages in Java and Jsoup was on of the best libraries to work with. But, Unfortunately I have to port the whole code to C/C++. But I a cannot find any decent html parser to use on c++. Is there any Jsoup like library for C++ or How can similar results be achieved?

[Currently I am using Curl to get the source of the pages and roaming the internet to find a html parser]

like image 839
Writwick Avatar asked Jul 29 '13 10:07

Writwick


People also ask

Is jsoup open source?

In this article, I will focus on one of my favorites, jsoup, which was first released as open source in January 2010. It has been under active development since then by Jonathan Hedley, and the code uses the liberal MIT license.

Is jsoup an API?

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

Does jsoup run JavaScript?

You can extract data by using CSS selectors, or by navigating and modifying the Document Object Model directly - just like a browser does, except you do it in Java code. You can also modify and write HTML out safely too. jsoup will not run JavaScript for you - if you need that in your app I'd recommend looking at JCEF.

How do you process HTML in Java?

Its party trick is a CSS selector syntax to find elements, e.g.: String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc. </p></body></html>"; Document doc = Jsoup. parse(html); Elements links = doc.


2 Answers

Unfortunately, i guess there's no parser like Jsoup for C++ ...

Beside the libraries which are already mentioned here, there's a good overview about C++ (some C too) parser here: Free C or C++ XML Parser Libraries

For parsing i used TinyXML-2 for (Html-) DOM parsing; it's a very small (only 2 files) library that runs on most OS (even non-desktop).

LibXml

  • push and pull parser (DOM, SAX)
  • Validation
  • XPath and XPointer support
  • Cross-Plattform / good documentation

Apache Xerxces

  • push and pull parser (DOM, SAX)
  • Validation
  • No XPath support (but a package for this?)
  • Cross-Plattform / good documentation

If you are on C++ CLI, check out NSoup - a Jsoup port for .NET.

Some more:

  • htmlcxx - html and css APIs for C++
  • MSHTML (?)
  • pugixml (DOM / XPath and Unicode support)
  • LibCSS (CSS Parser) / LibDOM (DOM) (however, both in C)
  • hcxselect (CSS selector engine for C++)

Maybe you can combine a DOM Model / Parser and a CSS selector together?

like image 139
ollo Avatar answered Oct 12 '22 12:10

ollo


If you are familiar with Qt Framework the most convenient way is using QWebElement (Reference here).

Otherwise, (as another post suggests) using Tidy to convert HTML to a valid XML and then using an XML parser such as libxml++ is a good option. You can find a sample code showing these two steps here.

like image 38
sgun Avatar answered Oct 12 '22 12:10

sgun