Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nested tags in BeautifulSoup - Python

I've looked at many examples on websites and on stackoverflow but I couldn't find a universal solution to my question. I'm dealing with a really messy website and I'd like to scrape some data. The markup looks like so:

...
<body>
...
    <table>
        <tbody>
            <tr>
            ...
            </tr>
            <tr>
                <td>
                ...
                </td>
                <td>
                    <table>
                        <tr>
                        ...
                        </tr>
                        <tr>
                            <td>
                                <a href="...">Some link</a>
                                <a href="...">Some link</a>
                                <a href="...">Some link</a>
                            </td>
                        </tr>
                    </table>
                </td>
            </tr>
        </tbody>
    </table>
</body>

The issue I'm having is that none of the elements have attributes that I can select to narrow down some scope. Inside each of the "..." there may be similar markup such as more <a>'s <table> and whatnot.

I know that table tr table tr td a is unique to the links I need, but how would BeautifulSoup grab those? I'm not sure how grab nested tags without doing a bunch of individual lines of code.

Any help?

like image 793
Eric Kim Avatar asked Apr 01 '13 18:04

Eric Kim


1 Answers

You can use CSS selectors in select:

soup.select('table tr table tr td a')

In [32]: bs4.BeautifulSoup(urllib.urlopen('http://google.com/?hl=en').read()).select('#footer a')
Out[32]:
[<a href="/intl/en/ads/">Advertising Programs</a>,
 <a href="/services/">Business Solutions</a>,
 <a href="https://plus.google.com/116899029375914044550" rel="publisher">+Google</a>,
 <a href="/intl/en/about.html">About Google</a>,
 <a href="http://www.google.com/setprefdomain?prefdom=RU&amp;prev=http://www.google.ru/&amp;sig=0_3F2sRGWVktTCOFLA955Vr-AWlHo%3D">Google.ru</a>,
 <a href="/intl/en/policies/">Privacy &amp; Terms</a>]
like image 178
Pavel Anossov Avatar answered Oct 24 '22 04:10

Pavel Anossov