I need to find elements in an HTML string. Unfortunately the HTML is pretty much broken (e.g. closing tags without an opening pair).
I tried to use XPath with launchpad.net/xmlpath but it can't parse an HTML file so damn buggy.
How can I find elements in a broken HTML with golang? I would prefer using XPath, but I am open for other solutions too if I can use it to look for tags with a specific id or class.
It seems net/html does the job.
So that's what I am doing now:
package main
import (
"strings"
"golang.org/x/net/html"
"log"
"bytes"
"gopkg.in/xmlpath.v2"
)
func main() {
brokenHtml := `<!DOCTYPE html><html><body><h1 id="someid">My First Heading</h1><p>paragraph</body></html>`
reader := strings.NewReader(brokenHtml)
root, err := html.Parse(reader)
if err != nil {
log.Fatal(err)
}
var b bytes.Buffer
html.Render(&b, root)
fixedHtml := b.String()
reader = strings.NewReader(fixedHtml)
xmlroot, xmlerr := xmlpath.ParseHTML(reader)
if xmlerr != nil {
log.Fatal(xmlerr)
}
var xpath string
xpath = `//h1[@id='someid']`
path := xmlpath.MustCompile(xpath)
if value, ok := path.String(xmlroot); ok {
log.Println("Found:", value)
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With