Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert an HTML table into a Python dictionary

I have the following HTML excerpt in a format of a Python list that I'd like to turn into a dictionary. It is a timetable for everyday of the week.

[u'
<table class="hours table">\n
    <tbody>\n
        <tr>\n
            <th scope="row">Mon</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Tue</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Wed</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n <span class="nowrap open">Open now</span>\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Thu</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Fri</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Sat</th>\n
            <td>\n <span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Sun</th>\n
            <td>\n Closed\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n </tbody>\n </table>']

The wishful output is:

{
'Mon': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Tue': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Wed': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Thu': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Fri': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Sat': '5:00pm - 10:00pm', 
'Sun': 'Closed'
}

How would you achieve this in Python 3.x? I would not mind if the 'Sat' and 'Sun' keys have values in a list format if that'd help at all. Thank you for your thoughts in advance.

like image 352
mysl Avatar asked Aug 02 '18 15:08

mysl


2 Answers

Here's a solution which first reads into Pandas DataFrame, and then converts to dictionary as in your desired output:

import pandas as pd

dfs = pd.read_html(html_string)
df = dfs[0]  # pd.read_html reads in all tables and returns a list of DataFrames

Giving:

     0                                      1         2
0  Mon  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
1  Tue  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
2  Wed  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm  Open now
3  Thu  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
4  Fri  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
5  Sat                     5:00 pm - 10:00 pm       NaN
6  Sun                                 Closed       NaN

Then use groupby and a dictionary comprehension:

summary = {k: v.iloc[0, 1].split('  ') for k, v in df.groupby(0)}

Giving:

{'Fri': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Mon': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Sat': ['5:00 pm - 10:00 pm'],
 'Sun': ['Closed'],
 'Thu': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Tue': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
 'Wed': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']}

You may need to edit slightly if splitting on exactly two spaces won't always work for your opening times data format.

like image 162
sjw Avatar answered Oct 21 '22 22:10

sjw


from bs4 import BeautifulSoup
from collections import OrderedDict
from pprint import pprint

soup = BeautifulSoup(data, 'lxml')

d = OrderedDict()
for th, td in zip(soup.select('th'), soup.select('td')[::2]):
    d[th.text.strip()] = td.text.strip().splitlines()

pprint(d)

Prints:

OrderedDict([('Mon', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
             ('Tue', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
             ('Wed', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
             ('Thu', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
             ('Fri', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
             ('Sat', ['5:00 pm - 10:00 pm']),
             ('Sun', ['Closed'])])
like image 7
Andrej Kesely Avatar answered Oct 21 '22 21:10

Andrej Kesely