I am using BeautifulSoup for a project. Here is my HTML structure
<div class="container">
<div class="fruits">
<div class="apple">
<p>John</p>
<p>Sam</p>
<p>Bailey</p>
<p>Jack</p>
<ul>
<li>Sour</li>
<li>Sweet</li>
<li>Salty</li>
</ul>
<span>Fruits are good</span>
</div>
<div class="mango">
<p>Randy</p>
<p>James</p>
</div>
</div>
<div class="apple">
<p>Bill</p>
<p>Sean</p>
</div>
</div>
Now I want to grab text in div class 'apple' which falls under class 'fruits'
This is what I have tried so far ....
for node in soup.find_all("div", class_="apple")
Its returning ...
But I want it to return only ...
Please note that I DO NOT know the exact structure of elements inside div class="apple" There can be any type of different HTML elements inside that class. So the selector has to be flexible enough.
Here is the full code, where I need to add this BeautifulSoup code ...
class MySpider(CrawlSpider):
name = 'dknnews'
start_urls = ['http://www.example.com/uat-area/scrapy/all-news-listing/_recache']
allowed_domains = ['example.com']
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
#soup = BeautifulSoup(content.decode('utf-8','ignore'))
nf = NewsFields()
ptype = soup.find_all(attrs={"name":"dknpagetype"})
ptitle = soup.find_all(attrs={"name":"dknpagetitle"})
pturl = soup.find_all(attrs={"name":"dknpageurl"})
ptdate = soup.find_all(attrs={"name":"dknpagedate"})
ptdesc = soup.find_all(attrs={"name":"dknpagedescription"})
for node in soup.find_all("div", class_="apple"): <!-- THIS IS WHERE I NEED TO ADD THE BS CODE -->
ptbody = ''.join(node.find_all(text=True))
ptbody = ' '.join(ptbody.split())
nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
nf['bodytext'] = ptbody.encode('ascii', 'ignore')
yield nf
for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract():
yield Request(url, callback=self.parse)
I am not sure how to use nested selectors with BeautifulSoup find_all ?
Any help is very appreciated.
Thanks
soup.select('.fruits .apple p')
use CSSselector, it's very easy to express class.
soup.find(class_='fruits').find(class_="apple").find_all('p')
Or, you can use find()
to get the p
tag step by step
EDIT:
[s for div in soup.select('.fruits .apple') for s in div.stripped_strings]
use strings
generator to get all the string under the div
tag, stripped_strings
will get rid of \n
in the results.
out:
['John', 'Sam', 'Bailey', 'Jack', 'Sour', 'Sweet', 'Salty', 'Fruits are good']
Full code:
from bs4 import BeautifulSoup
source_code = """<div class="container">
<div class="fruits">
<div class="apple">
<p>John</p>
<p>Sam</p>
<p>Bailey</p>
<p>Jack</p>
<ul>
<li>Sour</li>
<li>Sweet</li>
<li>Salty</li>
</ul>
<span>Fruits are good</span>
</div>
<div class="mango">
<p>Randy</p>
<p>James</p>
</div>
</div>
<div class="apple">
<p>Bill</p>
<p>Sean</p>
</div>
</div>
"""
soup = BeautifulSoup(source_code, 'lxml')
[s for div in soup.select('.fruits .apple') for s in div.stripped_strings]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With