I have been writing some codes to get some data from some pages in Java and Jsoup was on of the best libraries to work with. But, Unfortunately I have to port the whole code to C/C++. But I a cannot find any decent html parser to use on c++. Is there any Jsoup like library for C++ or How can similar results be achieved?
[Currently I am using Curl to get the source of the pages and roaming the internet to find a html parser]
In this article, I will focus on one of my favorites, jsoup, which was first released as open source in January 2010. It has been under active development since then by Jonathan Hedley, and the code uses the liberal MIT license.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
You can extract data by using CSS selectors, or by navigating and modifying the Document Object Model directly - just like a browser does, except you do it in Java code. You can also modify and write HTML out safely too. jsoup will not run JavaScript for you - if you need that in your app I'd recommend looking at JCEF.
Its party trick is a CSS selector syntax to find elements, e.g.: String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc. </p></body></html>"; Document doc = Jsoup. parse(html); Elements links = doc.
Unfortunately, i guess there's no parser like Jsoup for C++ ...
Beside the libraries which are already mentioned here, there's a good overview about C++ (some C too) parser here: Free C or C++ XML Parser Libraries
For parsing i used TinyXML-2 for (Html-) DOM parsing; it's a very small (only 2 files) library that runs on most OS (even non-desktop).
LibXml
Apache Xerxces
If you are on C++ CLI, check out NSoup - a Jsoup port for .NET.
Some more:
Maybe you can combine a DOM Model / Parser and a CSS selector together?
If you are familiar with Qt Framework the most convenient way is using QWebElement (Reference here).
Otherwise, (as another post suggests) using Tidy to convert HTML to a valid XML and then using an XML parser such as libxml++ is a good option. You can find a sample code showing these two steps here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With