Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how do i stop beautiful soup from skipping rows while parsing?

while using beautifulsoup to parse a table in html every other row starts with

<tr class="row_k">

instead of a tr tag without a class

Sample HTML

<tr class="row_k"> 
<td><img src="some picture url" alt="Item A"></td> 
<td><a href="some url"> Item A</a></td> 
<td>14.8k</td> 
<td><span class="drop">-555</span></td> 
<td> 
<img src="some picture url" alt="stuff" title="stuff"> 
</td> 
<td> 
<img src="some picture url" alt="Max llll"> 
</td> 
</tr> 
<tr> 
<td><img src="some picture url" alt="Item B"></td> 
<td><a href="some url"> Item B</a></td> 
<td>64.9k</td> 
<td><span class="rise">+165</span></td> 
<td> 
<img src="some picture url" alt="stuff" title="stuff"> 
</td> 
<td> 
<img src="some picture url" alt="max llll"> 
</td> 
</tr> 
<tr class="row_k"> 
<td><img src="some picture url" alt="Item C"></td> 
<td><a href="some url"> Item C</a></td> 
<td>4,000</td> 
<td><span class="rise">+666</span></td> 
<td> 
<img src="some picture url" title="stuff"> 
</td> 
<td> 
<img src="some picture url" alt="Maximum lllle"> 

Text I wish to extract is 14.8k, 64.9k, and 4,000

this1 = urllib2.urlopen('my url').read()
this_1 = BeautifulSoup(this1)
this_1a = StringIO.StringIO()
for row in this_1.findAll("tr", { "class" : "row_k" }):
  for col in row.findAll(re.compile('td')):
    this_1a.write(col.string if col.string else '')
Item_this1 = this_1a.getvalue()

I get the feeling that this code is poorly written, Is there a more flexible tool I can use such as an XML parser? that someone could suggest.

still open to any answers that still utilize beautifulsoup.

like image 910
Pevo Avatar asked Nov 18 '25 00:11

Pevo


1 Answers

I am still learning a lot but I am going to suggest you try lxml. I am going to make a stab at this and I think it will mostly get you there but there may be some niceties I am not certain about.

assuming this1 is a string

from lxml.html import fromstring
this1_tree=fromstring(this1)
all_cells=[(item[0], item[1]) for item in enumerate(this1_tree.cssselect('td'))] # I am hoping this gives you the cells with their relative position in the document)

The only thing I am not totally certain about is whether you test the key or value or text_content for each cell to find out if it has the string that you are seeking in the anchor reference or text. That is why I wanted a sample of your html. But one of those should work

the_cell_before_numbers=[]
for cell in all_cells:
    if 'Item' in cell[1].text_content():
        the_cell_before_numbers.append(cell[0])

Now that you have the cell before your can then get the value you need by getting the text content of the next cell

todays_price=all_cells[the_cell_before_number+1][1].text_content()

I am sure there is a prettier way but I think this will get you there.

I tested using your html and I got what you were looking for.

like image 88
PyNEwbie Avatar answered Nov 20 '25 14:11

PyNEwbie



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!