Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Golang parse HTML, extract all content with <body> </body> tags

Tags:

html

go

As stated in the title. I am needing to return all of the content within the body tags of an html document, including any subsequent html tags, etc. Im curious to know what the best way to go about this is. I had a working solution with the Gokogiri package, however I am trying to stay away from any packages that depend on C libraries. Is there a way to accomplish this with the go standard library? or with a package that is 100% go?

Since posting my original question I have attempted to use the following packages that have yielded no resolution. (Neither of which seem to return subsequent children or nested tags from inside the body. For example:

<!DOCTYPE html> <html>     <head>         <title>             Title of the document         </title>     </head>     <body>         body content          <p>more content</p>     </body> </html> 

will return body content, ignoring the subsequent <p> tags and the text they wrap):

  • pkg/encoding/xml/ (standard library xml package)
  • golang.org/x/net/html

The over all goal would be to obtain a string or content that would look like:

<body>     body content      <p>more content</p> </body> 
like image 521
user2737876 Avatar asked May 07 '15 18:05

user2737876


1 Answers

This can be solved by recursively finding the body node, using the html package, and subsequently render the html, starting from that node.

package main  import (     "bytes"     "errors"     "fmt"     "golang.org/x/net/html"     "io"     "strings" )  func Body(doc *html.Node) (*html.Node, error) {     var body *html.Node     var crawler func(*html.Node)     crawler = func(node *html.Node) {         if node.Type == html.ElementNode && node.Data == "body" {             body = node             return         }         for child := node.FirstChild; child != nil; child = child.NextSibling {             crawler(child)         }     }     crawler(doc)     if body != nil {         return body, nil     }     return nil, errors.New("Missing <body> in the node tree") }  func renderNode(n *html.Node) string {     var buf bytes.Buffer     w := io.Writer(&buf)     html.Render(w, n)     return buf.String() }  func main() {     doc, _ := html.Parse(strings.NewReader(htm))     bn, err := Body(doc)     if err != nil {         return     }     body := renderNode(bn)     fmt.Println(body) }  const htm = `<!DOCTYPE html> <html> <head>     <title></title> </head> <body>     body content     <p>more content</p> </body> </html>` 
like image 106
Joachim Birche Avatar answered Sep 20 '22 06:09

Joachim Birche