Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parsing HTML on the iPhone [closed]

Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate.

Does such a library exist, or am I better off just trying to use regular expressions?

like image 236
Sophie Alpert Avatar asked Jan 02 '09 00:01

Sophie Alpert


People also ask

What is parsing HTML mean?

Parsing means analyzing and converting a program into an internal format that a runtime environment can actually run, for example the JavaScript engine inside browsers. The browser parses HTML into a DOM tree. HTML parsing involves tokenization and tree construction.

What library is suitable for parsing HTML?

Html5lib. html5lib is a pure-python library for parsing HTML.


1 Answers

I found using hpple quite useful to parse messy HTML. Hpple project is a Objective-C wrapper on the XPathQuery library for parsing HTML. Using it you can send an XPath query and receive the result .

Requirements:

-Add libxml2 includes to your project

  1. Menu Project->Edit Project Settings
  2. Search for setting "Header Search Paths"
  3. Add a new search path "${SDKROOT}/usr/include/libxml2"
  4. Enable recursive option

-Add libxml2 library to to your project

  1. Menu Project->Edit Project Settings
  2. Search for setting "Other Linker Flags"
  3. Add a new search flag "-lxml2"

-From hpple get the following source code files an add them to your project:

  1. TFpple.h
  2. TFpple.m
  3. TFppleElement.h
  4. TFppleElement.m
  5. XPathQuery.h
  6. XPathQuery.m

-Take a walk on w3school XPath Tutorial to feel comfortable with the XPath language.

Code Example

#import "TFHpple.h"  NSData *data = [[NSData alloc] initWithContentsOfFile:@"example.html"];  // Create parser xpathParser = [[TFHpple alloc] initWithHTMLData:data];  //Get all the cells of the 2nd row of the 3rd table  NSArray *elements  = [xpathParser searchWithXPathQuery:@"//table[3]/tr[2]/td"];  // Access the first cell TFHppleElement *element = [elements objectAtIndex:0];  // Get the text within the cell tag NSString *content = [element content];    [xpathParser release]; [data release]; 

Known issues

As hpple is a wrapper over XPathQuery which is another wrapper, this option probably is not the most efficient. If performance is an issue in your project, I recommend to code your own lightweight solution based on hpple and xpathquery library code.

like image 124
Albaregar Avatar answered Sep 25 '22 17:09

Albaregar