Parse HTML Table with Python BeautifulSoup

Tags:

I am attempting to use BeautifulSoup to parse an html table which I uploaded to http://pastie.org/8070879 in order to get the three columns (0 to 735, 0.50 to 1.0 and 0.5 to 0.0) as lists. To explain why, I will want the integers 0-735 to be keys and the decimal numbers to be values.

From reading many of the other posts on SO, I have come up with the following which does not come close to creating the lists I want. All it does is display the text in the table as is seen here http://i1285.photobucket.com/albums/a592/TheNexulo/output_zps20c5afb8.png

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("fide.html"))
table = soup.find('table')

rows = table.findAll('tr')

for tr in rows:
  cols = tr.findAll('td')
  for td in cols:
     text = ''.join(td.find(text=True))
     print text + "|",
  print

I'm new to Python and BeautifulSoup, so please be gentle with me! Thanks

258

asked Jun 23 '13 01:06

Keith

1 Answers

HTML parsers like BeautifulSoup presume that what you want is an object model that mirrors the input HTML structure. But sometimes (like in this case) that model gets in the way more than helps. Pyparsing includes some HTML parsing features that are more robust than just using raw regexes, but otherwise work in similar fashion, letting you define snippets of HTML of interest, and just ignoring the rest. Here is a parser that reads through your posted HTML source:

from pyparsing import makeHTMLTags,withAttribute,Suppress,Regex,Group

""" looking for this recurring pattern:
          <td valign="top" bgcolor="#FFFFCC">00-03</td>
          <td valign="top">.50</td>
          <td valign="top">.50</td>

    and want a dict with keys 0, 1, 2, and 3 all with values (.50,.50)
"""

td,tdend = makeHTMLTags("td")
keytd = td.copy().setParseAction(withAttribute(bgcolor="#FFFFCC"))
td,tdend,keytd = map(Suppress,(td,tdend,keytd))

realnum = Regex(r'1?\.\d+').setParseAction(lambda t:float(t[0]))
integer = Regex(r'\d{1,3}').setParseAction(lambda t:int(t[0]))
DASH = Suppress('-')

# build up an expression matching the HTML bits above
entryExpr = (keytd + integer("start") + DASH + integer("end") + tdend + 
                    Group(2*(td + realnum + tdend))("vals"))

This parser not only picks out the matching triples, it also extracts the start-end integers and the pairs of real numbers (and also already converts from string to integers or floats at parse time).

Looking at the table, I'm guessing you actually want a lookup that will take a key like 700, and return the pair of values (0.99, 0.01), since 700 falls in the range of 620-735. This bit of code searches the source HTML text, iterates over the matched entries and inserts key-value pairs into the dict lookup:

# search the input HTML for matches to the entryExpr expression, and build up lookup dict
lookup = {}
for entry in entryExpr.searchString(sourcehtml):
    for i in range(entry.start, entry.end+1):
        lookup[i] = tuple(entry.vals)

And now to try out some lookups:

# print out some test values
for test in (0,20,100,700):
    print (test, lookup[test])

prints:

0 (0.5, 0.5)
20 (0.53, 0.47)
100 (0.64, 0.36)
700 (0.99, 0.01)

175

answered Sep 28 '22 02:09

PaulMcG

Related questions
                            
                                How to speed up iteration over part of a numpy array
                            
                                xlrd crashes when reading .xls file modified by PHPExcel
                            
                                Nose test single setup function called once
                            
                                Closing database connection from pipeline and middleware in Scrapy
                            
                                How to search for Chinese characters and short words in documentation generated by Sphinx?
                            
                                Can I use the slice method to return a list that excludes ranges in the middle of the original list?
                            
                                module object has no attribute 'create_frame'
                            
                                Return PostgreSQL UUID array as list with psycopg2
                            
                                Convert Mac Timestamps with python
                            
                                How sys.exc_info() works?
                            
                                How do I use Scrapy to crawl within pages?
                            
                                Update pandas DataFrame in stored in a Pytable with another pandas DataFrame
                            
                                Pandas groupby and Multiindex
                            
                                how to add a condition of comparing two fields of the model itself in Django model query
                            
                                Django Custom Form Validation IP address and Domain name
                            
                                Simple kivy tab example
                            
                                Numpy array get the subset/slice of an array which is not NaN
                            
                                Return C++ double to Python?
                            
                                make only left subplots and bottom subplots (in MxN panels) contain x and y axis labels
                            
                                Deleting a Label in Python Tkinter?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parse HTML Table with Python BeautifulSoup

Tags:

python

html

beautifulsoup

Keith

People also ask

1 Answers

PaulMcG

Recent Activity

Donate For Us