Parse HTML table to Python list?

Tags:

html

I'd like to take an HTML table and parse through it to get a list of dictionaries. Each list element would be a dictionary corresponding to a row in the table.

If, for example, I had an HTML table with three columns (marked by header tags), "Event", "Start Date", and "End Date" and that table had 5 entries, I would like to parse through that table to get back a list of length 5 where each element is a dictionary with keys "Event", "Start Date", and "End Date".

Thanks for the help!

680

asked Jun 12 '11 22:06

Andrew

2 Answers

You should use some HTML parsing library like lxml:

from lxml import etree s = """<table>   <tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>   <tr><td>a</td><td>b</td><td>c</td></tr>   <tr><td>d</td><td>e</td><td>f</td></tr>   <tr><td>g</td><td>h</td><td>i</td></tr> </table> """ table = etree.HTML(s).find("body/table") rows = iter(table) headers = [col.text for col in next(rows)] for row in rows:     values = [col.text for col in row]     print dict(zip(headers, values))

prints

{'End Date': 'c', 'Start Date': 'b', 'Event': 'a'} {'End Date': 'f', 'Start Date': 'e', 'Event': 'd'} {'End Date': 'i', 'Start Date': 'h', 'Event': 'g'}

114

answered Sep 23 '22 03:09

Sven Marnach

Hands down the easiest way to parse a HTML table is to use pandas.read_html() - it accepts both URLs and HTML.

import pandas as pd url = r'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies' tables = pd.read_html(url) # Returns list of all tables on page sp500_table = tables[0] # Select table of interest

Only downside is that read_html() doesn't preserve hyperlinks.

answered Sep 22 '22 03:09

zelusp

Related questions
                            
                                How to read a config file using python
                            
                                IPython Notebook ipywidgets does not show
                            
                                .doc to pdf using python
                            
                                Python: Unable to Render Tex in Matplotlib
                            
                                Changing hostname in a url
                            
                                Using a variable while calling logger.setLevel
                            
                                How to adjust the quality of a resized image in Python Imaging Library?
                            
                                Upgrade python without breaking yum
                            
                                Check if a string is hexadecimal
                            
                                set difference for pandas
                            
                                pandas applying regex to replace values
                            
                                Find "home directory" in Python? [duplicate]
                            
                                Suppress "None" output as string in Jinja2
                            
                                Is it possible to add PyQt4/PySide packages on a Virtualenv sandbox?
                            
                                How can I insert data into a MySQL database?
                            
                                How to get column names from SQLAlchemy result (declarative syntax)
                            
                                Use index in pandas to plot data
                            
                                What's the best way to handle Django's objects.get?
                            
                                How can I access the current executing module or class name in Python?
                            
                                Python 'self' keyword

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With