I've looked at many examples on websites and on stackoverflow but I couldn't find a universal solution to my question. I'm dealing with a really messy website and I'd like to scrape some data. The markup looks like so:
...
<body>
...
<table>
<tbody>
<tr>
...
</tr>
<tr>
<td>
...
</td>
<td>
<table>
<tr>
...
</tr>
<tr>
<td>
<a href="...">Some link</a>
<a href="...">Some link</a>
<a href="...">Some link</a>
</td>
</tr>
</table>
</td>
</tr>
</tbody>
</table>
</body>
The issue I'm having is that none of the elements have attributes that I can select to narrow down some scope. Inside each of the "..." there may be similar markup such as more <a>
's <table>
and whatnot.
I know that table tr table tr td a
is unique to the links I need, but how would BeautifulSoup grab those? I'm not sure how grab nested tags without doing a bunch of individual lines of code.
Any help?
You can use CSS selectors in select
:
soup.select('table tr table tr td a')
In [32]: bs4.BeautifulSoup(urllib.urlopen('http://google.com/?hl=en').read()).select('#footer a')
Out[32]:
[<a href="/intl/en/ads/">Advertising Programs</a>,
<a href="/services/">Business Solutions</a>,
<a href="https://plus.google.com/116899029375914044550" rel="publisher">+Google</a>,
<a href="/intl/en/about.html">About Google</a>,
<a href="http://www.google.com/setprefdomain?prefdom=RU&prev=http://www.google.ru/&sig=0_3F2sRGWVktTCOFLA955Vr-AWlHo%3D">Google.ru</a>,
<a href="/intl/en/policies/">Privacy & Terms</a>]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With