I am currently using Python and BeautifulSoup to scrape some website data. I'm trying to pull cells from a table which is formatted like so:
<tr><td>1<td><td>20<td>5%</td></td></td></td></tr>
The problem with the above HTML is that BeautifulSoup reads it as one tag. I need to pull the values from the first <td>
and the third <td>
, which would be 1 and 20, respectively.
Unfortunately, I have no idea how to go about this. How can I get BeautifulSoup to read the 1st and 3rd <td>
tags of each row of the table?
Update:
I figured out the problem. I was using html.parser
instead of the default for BeautifulSoup. Once I switched to the default the problems went away. Also I used the method listed in the answer.
I also found out that the different parsers are very temperamental with broken code. For instance, the default parser refused to read past row 192, but html5lib
got the job done.So try using lxml
, html
, and also html5lib
if you are having problems parsing the entire table.
BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select() method to find multiple elements and select_one() to find a single element.
That's a nasty piece of HTML you've got there. If we ignore the semantics of table rows and table cells for a moment and treat it as pure XML, its structure looks like this:
<tr>
<td>1
<td>
<td>20
<td>5%</td>
</td>
</td>
</td>
</tr>
BeautifulSoup, however, knows about the semantics of HTML tables, and instead parses it like this:
<tr>
<td>1 <!-- an IMPLICITLY (no closing tag) closed td element -->
<td> <!-- as above -->
<td>20 <!-- as above -->
<td>5%</td> <!-- an EXPLICITLY closed td element -->
</td> <!-- an error; ignore this -->
</td> <!-- as above -->
</td> <!-- as above -->
</tr>
... so that, as you say, 1 and 20 are in the first and third td
elements (not tags) respectively.
You can actually get at the contents of these td
elements like this:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<tr><td>1<td><td>20<td>5%</td></td></td></td></tr>")
>>> tr = soup.find("tr")
>>> tr
<tr><td>1</td><td></td><td>20</td><td>5%</td></tr>
>>> td_list = tr.find_all("td")
>>> td_list
[<td>1</td>, <td></td>, <td>20</td>, <td>5%</td>]
>>> td_list[0] # Python starts counting list items from 0, not 1
<td>1</td>
>>> td_list[0].text
'1'
>>> td_list[2].text
'20'
>>> td_list[3].text
'5%'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With