Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup in Python - getting the n-th tag of a type

I have some html code that contains many <table>s in it.

I'm trying to get the information in the second table. Is there a way to do this without using soup.findAll('table') ?

When I do use soup.findAll('table'), I get an error:

ValueError: too many values to unpack

Is there a way to get the n-th tag in some code or another way that does not require going through all the tables? Or should I see if I can add titles to the tables? (like <table title="things">)

There are also headers (<h4>title</h4>) above each table, if that helps.

Thanks.

EDIT

Here's what I was thinking when I asked the question:

I was unpacking the objects into two values, when there were many more. I thought this would just give me the first two things from the list, but of course, it kept giving me the error mentioned above. I was unaware the return value was a list and thought it was a special object or something and I was basing my code off of my friends'.

I was thinking this error meant there were too many tables on the page and that it couldn't handle all of them, so I was asking for a way to do it without the method I was using. I probably should have stopped assuming things.

Now I know it returns a list and I can use this in a for loop or get a value from it with soup.findAll('table')[someNumber]. I learned what unpacking was and how to use it, as well. Thanks everyone who helped.

Hopefully that clears things up, now that I know what I'm doing my question makes less sense than it did when I asked it, so I thought I'd just put a note here on what I was thinking.

EDIT 2:

This question is now pretty old, but I still see that I was never really clear about what I was doing.

If it helps anyone, I was attempting to unpack the findAll(...) results, of which the amount of them I didn't know.

useless_table, table_i_want, another_useless_table = soup.findAll("table");

Since there weren't always the amount of tables I had guessed in the page, and all the values in the tuple need to be unpacked, I was receiving the ValueError:

ValueError: too many values to unpack

So, I was looking for the way to grab the second (or whichever index) table in the tuple returned without running into errors about how many tables were used.

like image 285
nasonfish Avatar asked Dec 30 '12 22:12

nasonfish


People also ask

What is a tag in Beautifulsoup?

Going down. One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag's children). Beautiful Soup provides different ways to navigate and iterate over's tag's children.

Is Tag an object of Beautifulsoup?

A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document. Tags contain lot of attributes and methods and two important features of a tag are its name and attributes.


2 Answers

To get the second table from the call soup.findAll('table'), use it as a list, just index it:

secondtable = soup.findAll('table')[1]
like image 102
Martijn Pieters Avatar answered Sep 21 '22 05:09

Martijn Pieters


Martjin Pieter's answer will make it work indeed. I had some experience with nested table tag which broke my code when I just simply get the second table in the list without paying attention.

When you try to find_all and get the nth element, there is a potential you will mess up, you had better locate the first element you want and make sure the n-th element is actually a sibling of that element instead of children.

  1. You can use the find_next_sibling() to secure your code
  2. you can find the parent first and then use find_all(recursive=False) to guarantee your search range.

Just in case you need it. I will list my code below(use recursive=FALSE).

import urllib2
from bs4 import BeautifulSoup

text = """
<html>
    <head>
    </head>
    <body>
        <table>
            <p>Table1</p>
            <table>
                <p>Extra Table</p>
            </table>
        </table>
        <table>
            <p>Table2</p>
        </table>
    </body>
</html>
"""

soup = BeautifulSoup(text)

tables = soup.find('body').find_all('table')
print len(tables)
print tables[1].text.strip()
#3
#Extra Table # which is not the table you want without warning

tables = soup.find('body').find_all('table', recursive=False)
print len(tables)
print tables[1].text.strip()
#2
#Table2 # your desired output
like image 45
B.Mr.W. Avatar answered Sep 20 '22 05:09

B.Mr.W.