Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse HTML Table with Python BeautifulSoup

I am attempting to use BeautifulSoup to parse an html table which I uploaded to http://pastie.org/8070879 in order to get the three columns (0 to 735, 0.50 to 1.0 and 0.5 to 0.0) as lists. To explain why, I will want the integers 0-735 to be keys and the decimal numbers to be values.

From reading many of the other posts on SO, I have come up with the following which does not come close to creating the lists I want. All it does is display the text in the table as is seen here http://i1285.photobucket.com/albums/a592/TheNexulo/output_zps20c5afb8.png

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("fide.html"))
table = soup.find('table')

rows = table.findAll('tr')

for tr in rows:
  cols = tr.findAll('td')
  for td in cols:
     text = ''.join(td.find(text=True))
     print text + "|",
  print 

I'm new to Python and BeautifulSoup, so please be gentle with me! Thanks

like image 258
Keith Avatar asked Jun 23 '13 01:06

Keith


People also ask

Can BeautifulSoup handle broken HTML?

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.


1 Answers

HTML parsers like BeautifulSoup presume that what you want is an object model that mirrors the input HTML structure. But sometimes (like in this case) that model gets in the way more than helps. Pyparsing includes some HTML parsing features that are more robust than just using raw regexes, but otherwise work in similar fashion, letting you define snippets of HTML of interest, and just ignoring the rest. Here is a parser that reads through your posted HTML source:

from pyparsing import makeHTMLTags,withAttribute,Suppress,Regex,Group

""" looking for this recurring pattern:
          <td valign="top" bgcolor="#FFFFCC">00-03</td>
          <td valign="top">.50</td>
          <td valign="top">.50</td>

    and want a dict with keys 0, 1, 2, and 3 all with values (.50,.50)
"""

td,tdend = makeHTMLTags("td")
keytd = td.copy().setParseAction(withAttribute(bgcolor="#FFFFCC"))
td,tdend,keytd = map(Suppress,(td,tdend,keytd))

realnum = Regex(r'1?\.\d+').setParseAction(lambda t:float(t[0]))
integer = Regex(r'\d{1,3}').setParseAction(lambda t:int(t[0]))
DASH = Suppress('-')

# build up an expression matching the HTML bits above
entryExpr = (keytd + integer("start") + DASH + integer("end") + tdend + 
                    Group(2*(td + realnum + tdend))("vals"))

This parser not only picks out the matching triples, it also extracts the start-end integers and the pairs of real numbers (and also already converts from string to integers or floats at parse time).

Looking at the table, I'm guessing you actually want a lookup that will take a key like 700, and return the pair of values (0.99, 0.01), since 700 falls in the range of 620-735. This bit of code searches the source HTML text, iterates over the matched entries and inserts key-value pairs into the dict lookup:

# search the input HTML for matches to the entryExpr expression, and build up lookup dict
lookup = {}
for entry in entryExpr.searchString(sourcehtml):
    for i in range(entry.start, entry.end+1):
        lookup[i] = tuple(entry.vals)

And now to try out some lookups:

# print out some test values
for test in (0,20,100,700):
    print (test, lookup[test])

prints:

0 (0.5, 0.5)
20 (0.53, 0.47)
100 (0.64, 0.36)
700 (0.99, 0.01)
like image 175
PaulMcG Avatar answered Sep 28 '22 02:09

PaulMcG