<p>
<a name="533660373"></a>
<strong>Title: Point of Sale Threats Proliferate</strong><br />
<strong>Severity: Normal Severity</strong><br />
<strong>Published: Thursday, December 04, 2014 20:27</strong><br />
Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
<em>Analysis: Emboldened by past success and media attention, threat actors ..</em>
<br />
</p>
This is a paragraph I want to extact from an HTML page using BeautifulSoup in Python. I am able to get values inside tags, using the .children & .string methods. But I am unable to get the text "Several new Point of Sale malware fa..." which is inside paragraph without any tag. I tried using soup.p.text , .get_text() etc.. but no use.
import urllib.request
from bs4 import BeautifulSoup
url = "https://www.geeksforgeeks.org/how-to-automate-an-excel-sheet-in-python/?ref=feed"
html = urllib.request.urlopen(url)
htmlParse = BeautifulSoup(html, 'html.parser')
for para in htmlParse.find_all("p"):
print(para.get_text())
Use find_all()
with text=True
to find all text nodes and recursive=False
to search only among direct children of the parent p
tag:
from bs4 import BeautifulSoup
data = """
<p>
<a name="533660373"></a>
<strong>Title: Point of Sale Threats Proliferate</strong><br />
<strong>Severity: Normal Severity</strong><br />
<strong>Published: Thursday, December 04, 2014 20:27</strong><br />
Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
<em>Analysis: Emboldened by past success and media attention, threat actors ..</em>
<br />
</p>
"""
soup = BeautifulSoup(data)
print ''.join(text.strip() for text in soup.p.find_all(text=True, recursive=False))
Prints:
Several new Point of Sale malware families have emerged recently, to include LusyPOS,..
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With