Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get the first and third td from a table with BeautifulSoup?

I am currently using Python and BeautifulSoup to scrape some website data. I'm trying to pull cells from a table which is formatted like so:

<tr><td>1<td><td>20<td>5%</td></td></td></td></tr>

The problem with the above HTML is that BeautifulSoup reads it as one tag. I need to pull the values from the first <td> and the third <td>, which would be 1 and 20, respectively.

Unfortunately, I have no idea how to go about this. How can I get BeautifulSoup to read the 1st and 3rd <td> tags of each row of the table?

Update:

I figured out the problem. I was using html.parser instead of the default for BeautifulSoup. Once I switched to the default the problems went away. Also I used the method listed in the answer.

I also found out that the different parsers are very temperamental with broken code. For instance, the default parser refused to read past row 192, but html5lib got the job done.So try using lxml, html, and also html5lib if you are having problems parsing the entire table.

like image 632
Alex Ketay Avatar asked Aug 14 '13 08:08

Alex Ketay


People also ask

How do I find a specific element with Beautifulsoup?

BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select() method to find multiple elements and select_one() to find a single element.


1 Answers

That's a nasty piece of HTML you've got there. If we ignore the semantics of table rows and table cells for a moment and treat it as pure XML, its structure looks like this:

<tr>
  <td>1
    <td>
      <td>20
        <td>5%</td>
      </td>
    </td>
  </td>
</tr>

BeautifulSoup, however, knows about the semantics of HTML tables, and instead parses it like this:

<tr>
  <td>1        <!-- an IMPLICITLY (no closing tag) closed td element -->
  <td>         <!-- as above -->
  <td>20       <!-- as above -->
  <td>5%</td>  <!-- an EXPLICITLY closed td element -->
  </td>        <!-- an error; ignore this -->
  </td>        <!-- as above -->
  </td>        <!-- as above -->
</tr>

... so that, as you say, 1 and 20 are in the first and third td elements (not tags) respectively.

You can actually get at the contents of these td elements like this:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<tr><td>1<td><td>20<td>5%</td></td></td></td></tr>")
>>> tr = soup.find("tr")
>>> tr
<tr><td>1</td><td></td><td>20</td><td>5%</td></tr>
>>> td_list = tr.find_all("td")
>>> td_list
[<td>1</td>, <td></td>, <td>20</td>, <td>5%</td>]
>>> td_list[0]  # Python starts counting list items from 0, not 1
<td>1</td>
>>> td_list[0].text
'1'
>>> td_list[2].text
'20'
>>> td_list[3].text
'5%'
like image 73
Zero Piraeus Avatar answered Oct 06 '22 05:10

Zero Piraeus