How can I get the first and third td from a table with BeautifulSoup?

Tags:

I am currently using Python and BeautifulSoup to scrape some website data. I'm trying to pull cells from a table which is formatted like so:

<tr><td>1<td><td>20<td>5%</td></td></td></td></tr>

The problem with the above HTML is that BeautifulSoup reads it as one tag. I need to pull the values from the first <td> and the third <td>, which would be 1 and 20, respectively.

Unfortunately, I have no idea how to go about this. How can I get BeautifulSoup to read the 1st and 3rd <td> tags of each row of the table?

Update:

I figured out the problem. I was using html.parser instead of the default for BeautifulSoup. Once I switched to the default the problems went away. Also I used the method listed in the answer.

I also found out that the different parsers are very temperamental with broken code. For instance, the default parser refused to read past row 192, but html5lib got the job done.So try using lxml, html, and also html5lib if you are having problems parsing the entire table.

632

asked Aug 14 '13 08:08

Alex Ketay

1 Answers

That's a nasty piece of HTML you've got there. If we ignore the semantics of table rows and table cells for a moment and treat it as pure XML, its structure looks like this:

<tr>
  <td>1
    <td>
      <td>20
        <td>5%</td>
      </td>
    </td>
  </td>
</tr>

BeautifulSoup, however, knows about the semantics of HTML tables, and instead parses it like this:

<tr>
  <td>1        <!-- an IMPLICITLY (no closing tag) closed td element -->
  <td>         <!-- as above -->
  <td>20       <!-- as above -->
  <td>5%</td>  <!-- an EXPLICITLY closed td element -->
  </td>        <!-- an error; ignore this -->
  </td>        <!-- as above -->
  </td>        <!-- as above -->
</tr>

... so that, as you say, 1 and 20 are in the first and third td elements (not tags) respectively.

You can actually get at the contents of these td elements like this:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<tr><td>1<td><td>20<td>5%</td></td></td></td></tr>")
>>> tr = soup.find("tr")
>>> tr
<tr><td>1</td><td></td><td>20</td><td>5%</td></tr>
>>> td_list = tr.find_all("td")
>>> td_list
[<td>1</td>, <td></td>, <td>20</td>, <td>5%</td>]
>>> td_list[0]  # Python starts counting list items from 0, not 1
<td>1</td>
>>> td_list[0].text
'1'
>>> td_list[2].text
'20'
>>> td_list[3].text
'5%'

answered Oct 06 '22 05:10

Zero Piraeus

Related questions
                            
                                Want to find a way of doing an average of multiple lists
                            
                                Command output parsing in Python
                            
                                Convert numpy scalar to simple python type [duplicate]
                            
                                Python "'module' object is not callable"
                            
                                How to download a zip file from a site (python) [closed]
                            
                                Django: how to log exceptions from management commands?
                            
                                How do I create a numpy array using a function?
                            
                                iterate python nested lists efficiently
                            
                                os.system vs subprocess in python on linux
                            
                                PyQt 4: Making a label scrollable
                            
                                Jinja has a "center" formatting option, but how about "right align"?
                            
                                Pymongo Not creating collection in mongodb
                            
                                Get all text from an XML document?
                            
                                Geopy: calculating GPS heading / bearing
                            
                                Un-normalized Gaussian curve on histogram
                            
                                escape a string for shell commands in Python [duplicate]
                            
                                Close Socket in Django - Error: [Errno 48] Address already in use
                            
                                matplotlib: How to pick up shift click on figure?
                            
                                Python equivalent of Scala's lazy val
                            
                                Find highest weight edge(s) for a given node

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I get the first and third td from a table with BeautifulSoup?

Tags:

python

html

html-table

html-parsing

beautifulsoup

Alex Ketay

People also ask

1 Answers

Zero Piraeus

Recent Activity

Donate For Us