Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best practice to parse html in swift?

I'm a Swift newbie. I need for something like Python's BeautifulSoup in Swift iOS project. Precisely, I need to get all href of <a> that ends with ".txt". What are the steps that I should take?

like image 545
amazingbasil Avatar asked Jun 26 '15 19:06

amazingbasil


People also ask

What is the best HTML parser?

The best performers are Golang and C with very similar results. Python LIBXML2 performs fairly well. Ruby speed is similar to Python. Java parser tested is slower.

Should I use regex to parse HTML?

HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

How do you parse HTML?

If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.

Can we parse HTML?

Which means that you can parse HTML documents after they have been modified by JavaScript. Both the JavaScript included in the page or a script you add yourself. The following example, from the documentation, shows a few features of AngleSharp.


1 Answers

There are several nice libraries of HTML Parsing using Swift and Objective-C like the followings:

  • hpple
  • NDHpple
  • Kanna( old Swift-HTML-Parser)
  • Fuzi
  • SwiftSoup
  • Ji

Take a look in the following examples in the four libraries posted above, mainly parsed using XPath 2.0:

hpple:

let data = NSData(contentsOfFile: path) let doc = TFHpple(htmlData: data)  if let elements = doc.searchWithXPathQuery("//a/@href[ends-with(.,'.txt')]") as? [TFHppleElement] {    for element in elements {        println(element.content)    } } 

NDHpple:

let data = NSData(contentsOfFile: path)! let html = NSString(data: data, encoding: NSUTF8StringEncoding)! let doc = NDHpple(HTMLData: html) if let elements = doc.searchWithXPathQuery("//a/@href[ends-with(.,'.txt')]") {    for element in elements {      println(element.children?.first?.content)    } } 

Kanna (Xpath and CSS Selectors):

let html = "<html><head></head><body><ul><li><input type='image' name='input1' value='string1value' class='abc' /></li><li><input type='image' name='input2' value='string2value' class='def' /></li></ul><span class='spantext'><b>Hello World 1</b></span><span class='spantext'><b>Hello World 2</b></span><a href='example.com'>example(English)</a><a href='example.co.jp'>example(JP)</a></body>"  if let doc = Kanna.HTML(html: html, encoding: NSUTF8StringEncoding) {    var bodyNode   = doc.body     if let inputNodes = bodyNode?.xpath("//a/@href[ends-with(.,'.txt')]") {       for node in inputNodes {          println(node.contents)       }    } } 

Fuzi (Xpath and CSS Selectors):

let html = "<html><head></head><body><ul><li><input type='image' name='input1' value='string1value' class='abc' /></li><li><input type='image' name='input2' value='string2value' class='def' /></li></ul><span class='spantext'><b>Hello World 1</b></span><span class='spantext'><b>Hello World 2</b></span><a href='example.com'>example(English)</a><a href='example.co.jp'>example(JP)</a></body>"  do {   // if encoding is omitted, it defaults to NSUTF8StringEncoding   let doc = try HTMLDocument(string: html, encoding: NSUTF8StringEncoding)    // XPath queries   for anchor in doc.xpath("//a/@href[ends-with(.,'.txt')]") {     print(anchor.stringValue)   }  } catch let error {     print(error) } 

The ends-with function is part of Xpath 2.0.

SwiftSoup (CSS Selectors):

do{     let doc: Document = try SwiftSoup.parse("...")     let links: Elements = try doc.select("a[href]") // a with href     let pngs: Elements = try doc.select("img[src$=.png]")      // img with src ending .png     let masthead: Element? = try doc.select("div.masthead").first()      // div with class=masthead     let resultLinks: Elements? = try doc.select("h3.r > a") // direct a after h3 } catch Exception.Error(let type, let message){     print(message) } catch {    print("error") } 

Ji (XPath):

let jiDoc = Ji(htmlURL: URL(string: "http://www.apple.com/support")!) let titleNode = jiDoc?.xPath("//head/title")?.first print("title: \(titleNode?.content)") // title: Optional("Official Apple Support") 

I hope this helps you.

like image 167
Victor Sigler Avatar answered Sep 26 '22 00:09

Victor Sigler