I'm first time poster here trying to pick up some Python skills; please be kind to me :-)
While I'm not a complete stranger to programming concepts (I've been messing around with PHP before), the transition to Python has turned out to be somewhat difficult for me. I guess this mostly has to do with the fact that I lack most - if not all - basic understanding of common "design patterns" (?) and such.
Having that said, this is the problem. Part of my current project involves writing a simple scraper by utilizing Beautiful Soup. The data to be processed has a somewhat similar structure to the one which is laid out below.
<table>
<tr>
<td class="date">2011-01-01</td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link"><a href="#">Link</a></td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link"><a href="#">Link</a></td>
</tr>
<tr>
<td class="date">2011-01-02</td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link"><a href="#">Link</a></td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link"><a href="#">Link</a></td>
</tr>
</table>
The main issue is that I simply can't get my head around how to 1) keep track of the current date (tr->td class="date") while 2) looping over the items in the subsequent tr:s (tr class="item"->td class="headline" and tr class="item"->td class="link") and 3) store the processed data in an array.
Additionally, all data will be inserted into a database where each entry must contain the following information;
Note that crud:ing the database is not part of the problem, I only mentioned this in order to better illustrate what I'm trying to accomplish here :-)
Now, there are many different ways to skin a cat. So while a solution to the issue at hand is indeed very welcome, I'd be extremely grateful if someone would care to elaborate on the actual logic and strategy you would make use of in order to "attack" this kind of problem :-)
Last but not least, sorry for such a noobish question.
Iteration over rows using iterrows() In order to iterate over rows, we apply a iterrows() function this function returns each index value along with a series containing the data in each row.
The basic problem is that this table is marked up for looks, not for semantic structure. Properly done, each date and its related items should share a parent. Unfortunately, they don't, so we'll have to make do.
The basic strategy is to iterate through each row in the table
.
import BeautifulSoup
fname = r'c:\mydir\beautifulSoup.html'
soup = BeautifulSoup.BeautifulSoup(open(fname, 'r'))
items = []
last_seen_date = None
for el in soup.findAll('tr'):
daterow = el.find('td', {'class':'date'})
if daterow is None: # not a date - get headline and link
headline = el.find('td', {'class':'headline'}).text
link = el.find('a').get('href')
items.append((last_seen_date, headline, link))
else: # get new date
last_seen_date = daterow.text
You can use Element Tree which is included in the python package.
http://docs.python.org/library/xml.etree.elementtree.html
from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse('page.xhtml') #This is the XHTML provided in the OP
root = tree.getroot() #Returns the heading "table" element
print(root.tag) #"table"
for eachTableRow in root.getchildren():
#root.getchildren() is a list of all of the <tr> elements
#So we're going to loop over them and check their attributes
if 'class' in eachTableRow.attrib:
#Good to go. Now we know to look for the headline and link
pass
else:
#Okay, so look for the date
pass
That should be enough to get you on your way to parsing this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With