Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse HTML table to Python list?

Tags:

python

html

I'd like to take an HTML table and parse through it to get a list of dictionaries. Each list element would be a dictionary corresponding to a row in the table.

If, for example, I had an HTML table with three columns (marked by header tags), "Event", "Start Date", and "End Date" and that table had 5 entries, I would like to parse through that table to get back a list of length 5 where each element is a dictionary with keys "Event", "Start Date", and "End Date".

Thanks for the help!

like image 680
Andrew Avatar asked Jun 12 '11 22:06

Andrew


People also ask

How do you parse a table in HTML using Python?

To parse the table, we'd like to grab a row, take the data from its columns, and then move on to the next row ad nauseam. In the next bit of code, we define a website that is simply the HTML for a table. We load it into BeautifulSoup and parse it, returning a pandas data frame of the contents.

Can Python read HTML file?

library known as beautifulsoup. Using this library, we can search for the values of html tags and get specific data like title of the page and the list of headers in the page.

Can pandas read HTML file?

To read an HTML file, pandas dataframe looks for a tag . That tag is called a <td></td> tag. This tag is used for defining a table in HTML. pandas uses read_html() to read the HTML document.


2 Answers

You should use some HTML parsing library like lxml:

from lxml import etree s = """<table>   <tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>   <tr><td>a</td><td>b</td><td>c</td></tr>   <tr><td>d</td><td>e</td><td>f</td></tr>   <tr><td>g</td><td>h</td><td>i</td></tr> </table> """ table = etree.HTML(s).find("body/table") rows = iter(table) headers = [col.text for col in next(rows)] for row in rows:     values = [col.text for col in row]     print dict(zip(headers, values)) 

prints

{'End Date': 'c', 'Start Date': 'b', 'Event': 'a'} {'End Date': 'f', 'Start Date': 'e', 'Event': 'd'} {'End Date': 'i', 'Start Date': 'h', 'Event': 'g'} 
like image 114
Sven Marnach Avatar answered Sep 23 '22 03:09

Sven Marnach


Hands down the easiest way to parse a HTML table is to use pandas.read_html() - it accepts both URLs and HTML.

import pandas as pd url = r'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies' tables = pd.read_html(url) # Returns list of all tables on page sp500_table = tables[0] # Select table of interest 

Only downside is that read_html() doesn't preserve hyperlinks.

like image 23
zelusp Avatar answered Sep 22 '22 03:09

zelusp