Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Go Parse HTML table

I have a table in html that I would like to parse. Something like the one in the following http://sprunge.us/IJUC However, I'm not sure of a good way to parse out the information. I've seen a couple of html parsers, but those seem to require that everything has a special tag for you to parse it like info to grab; however, the majority of my info is within <td></td>

Does anyone have a suggestion for parsing this information out?

like image 898
Joe P. Avatar asked Oct 14 '12 14:10

Joe P.


People also ask

What is Goquery?

The goquery module lets you select and extract content from a website using a familiar syntax borrowed from CSS. Web scraping, also known as web data extraction, is an automated method of extracting data or content from web pages. Web scrapers automate data extraction without human interference.


1 Answers

Shameless plug: My goquery library. It's the jQuery syntax brought to Go (requires Go's experimental html package, see instructions in the README of the library).

So you can do things like that (assuming your HTML document is loaded in doc, a *goquery.Document):

doc.Find("td").Each(func (i int, s *goquery.Selection) {
  fmt.Printf("Content of cell %d: %s\n", i, s.Text())
})

Edit: Change doc.Root.Find to doc.Find in the example since a goquery Document is now a Selection too (new in v0.2/master branch)

like image 96
mna Avatar answered Sep 18 '22 13:09

mna