Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest, easiest, and best way to parse an HTML table?

I'm trying to get this table http://www.datamystic.com/timezone/time_zones.html into array format so I can do whatever I want with it. Preferably in PHP, python or JavaScript.

This is the kind of problem that comes up a lot, so rather than looking for help with this specific problem, I'm looking for ideas on how to solve all similar problems.

BeautifulSoup is the first thing that comes to mind. Another possibility is copying/pasting it in TextMate and then running regular expressions.

What do you suggest?

This is the script that I ended up writing, but as I said, I'm looking for a more general solution.

from BeautifulSoup import BeautifulSoup
import urllib2


url = 'http://www.datamystic.com/timezone/time_zones.html';
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
tables = soup.findAll("table")
table = tables[1]
rows = table.findAll("tr")
for row in rows:
    tds = row.findAll('td')
    if(len(tds)==4):
        countrycode = tds[1].string
        timezone = tds[2].string
        if(type(countrycode) is not type(None) and type(timezone) is not type(None)):
            print "\'%s\' => \'%s\'," % (countrycode.strip(), timezone.strip())

Comments and suggestions for improvement to my python code welcome, too ;)

like image 280
Zack Burt Avatar asked Feb 04 '11 00:02

Zack Burt


People also ask

How do I extract a table in HTML?

To extract a table from HTML, you first need to open your developer tools to see how the HTML looks and verify if it really is a table and not some other element. You open developer tools with the F12 key, see the “Elements” tab, and highlight the element you're interested in.

Can you parse HTML?

HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.


2 Answers

Avoid regular expressions for parsing HTML, they're simply not appropriate for it, you want a DOM parser like BeautifulSoup for sure...

A few other alternatives

  • SimpleHTMLDom PHP
  • Hpricot & Nokogiri Ruby
  • Web::Scraper Perl/CPAN

All of these are reasonably tolerant of poorly formed HTML.

like image 42
ocodo Avatar answered Sep 24 '22 04:09

ocodo


For your general problem: try lxml.html from the lxml package (think of it as the stdlibs xml.etree on steroids: the same xml api, but with html support, xpath, xslt etc...)

A quick example for your specific case:

from lxml import html

tree = html.parse('http://www.datamystic.com/timezone/time_zones.html')
table = tree.findall('//table')[1]
data = [
           [td.text_content().strip() for td in row.findall('td')] 
           for row in table.findall('tr')
       ]

This will give you a nested list: each sub-list corresponds to a row in the table and contains the data from the cells. The sneakily inserted advertisement rows are not filtered out yet, but it should get you on your way. (and by the way: lxml is fast!)

BUT: More specifically for your particular use case: there are better way to get at timezone database information than scraping that particular webpage (aside: note that the web page actually mentions that you are not allowed to copy its contents). There are even existing libraries that already use this information, see for example python-dateutil.

like image 174
Steven Avatar answered Sep 25 '22 04:09

Steven