Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to loop through a html-table-dataset in Python

I'm first time poster here trying to pick up some Python skills; please be kind to me :-)

While I'm not a complete stranger to programming concepts (I've been messing around with PHP before), the transition to Python has turned out to be somewhat difficult for me. I guess this mostly has to do with the fact that I lack most - if not all - basic understanding of common "design patterns" (?) and such.

Having that said, this is the problem. Part of my current project involves writing a simple scraper by utilizing Beautiful Soup. The data to be processed has a somewhat similar structure to the one which is laid out below.

<table>
    <tr>
        <td class="date">2011-01-01</td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
    <tr>
        <td class="date">2011-01-02</td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
</table>

The main issue is that I simply can't get my head around how to 1) keep track of the current date (tr->td class="date") while 2) looping over the items in the subsequent tr:s (tr class="item"->td class="headline" and tr class="item"->td class="link") and 3) store the processed data in an array.

Additionally, all data will be inserted into a database where each entry must contain the following information;

  • date
  • headline
  • link

Note that crud:ing the database is not part of the problem, I only mentioned this in order to better illustrate what I'm trying to accomplish here :-)

Now, there are many different ways to skin a cat. So while a solution to the issue at hand is indeed very welcome, I'd be extremely grateful if someone would care to elaborate on the actual logic and strategy you would make use of in order to "attack" this kind of problem :-)

Last but not least, sorry for such a noobish question.

like image 289
Mattias Avatar asked Jan 07 '11 02:01

Mattias


People also ask

Which method is used to iterate through the table rows?

Iteration over rows using iterrows() In order to iterate over rows, we apply a iterrows() function this function returns each index value along with a series containing the data in each row.


2 Answers

The basic problem is that this table is marked up for looks, not for semantic structure. Properly done, each date and its related items should share a parent. Unfortunately, they don't, so we'll have to make do.

The basic strategy is to iterate through each row in the table

  • if the first tabledata has class 'date', we get the date value and update last_seen_date
  • Otherwise, we get extract a headline and a link, then save (last_seen_date, headline, link) to the database

.

import BeautifulSoup

fname = r'c:\mydir\beautifulSoup.html'
soup = BeautifulSoup.BeautifulSoup(open(fname, 'r'))

items = []
last_seen_date = None
for el in soup.findAll('tr'):
    daterow = el.find('td', {'class':'date'})
    if daterow is None:     # not a date - get headline and link
        headline = el.find('td', {'class':'headline'}).text
        link = el.find('a').get('href')
        items.append((last_seen_date, headline, link))
    else:                   # get new date
        last_seen_date = daterow.text
like image 94
Hugh Bothwell Avatar answered Nov 13 '22 14:11

Hugh Bothwell


You can use Element Tree which is included in the python package.

http://docs.python.org/library/xml.etree.elementtree.html

from xml.etree.ElementTree import ElementTree

tree = ElementTree()
tree.parse('page.xhtml') #This is the XHTML provided in the OP
root = tree.getroot() #Returns the heading "table" element
print(root.tag) #"table"
for eachTableRow in root.getchildren(): 
    #root.getchildren() is a list of all of the <tr> elements
    #So we're going to loop over them and check their attributes
    if 'class' in eachTableRow.attrib:
        #Good to go. Now we know to look for the headline and link
        pass
    else:
        #Okay, so look for the date
        pass

That should be enough to get you on your way to parsing this.

like image 3
user407896 Avatar answered Nov 13 '22 12:11

user407896