Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate.
Does such a library exist, or am I better off just trying to use regular expressions?
Parsing means analyzing and converting a program into an internal format that a runtime environment can actually run, for example the JavaScript engine inside browsers. The browser parses HTML into a DOM tree. HTML parsing involves tokenization and tree construction.
Html5lib. html5lib is a pure-python library for parsing HTML.
I found using hpple quite useful to parse messy HTML. Hpple project is a Objective-C wrapper on the XPathQuery library for parsing HTML. Using it you can send an XPath query and receive the result .
Requirements:
-Add libxml2 includes to your project
-Add libxml2 library to to your project
-From hpple get the following source code files an add them to your project:
-Take a walk on w3school XPath Tutorial to feel comfortable with the XPath language.
Code Example
#import "TFHpple.h" NSData *data = [[NSData alloc] initWithContentsOfFile:@"example.html"]; // Create parser xpathParser = [[TFHpple alloc] initWithHTMLData:data]; //Get all the cells of the 2nd row of the 3rd table NSArray *elements = [xpathParser searchWithXPathQuery:@"//table[3]/tr[2]/td"]; // Access the first cell TFHppleElement *element = [elements objectAtIndex:0]; // Get the text within the cell tag NSString *content = [element content]; [xpathParser release]; [data release];
Known issues
As hpple is a wrapper over XPathQuery which is another wrapper, this option probably is not the most efficient. If performance is an issue in your project, I recommend to code your own lightweight solution based on hpple and xpathquery library code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With