<p>
    <a name="533660373"></a>
    <strong>Title: Point of Sale Threats Proliferate</strong><br />
    <strong>Severity: Normal Severity</strong><br />
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
    <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
    <br />
</p>
This is a paragraph I want to extact from an HTML page using BeautifulSoup in Python. I am able to get values inside tags, using the .children & .string methods. But I am unable to get the text "Several new Point of Sale malware fa..." which is inside paragraph without any tag. I tried using soup.p.text , .get_text() etc.. but no use.
import urllib.request
from bs4 import BeautifulSoup
url = "https://www.geeksforgeeks.org/how-to-automate-an-excel-sheet-in-python/?ref=feed"
html = urllib.request.urlopen(url)
htmlParse = BeautifulSoup(html, 'html.parser')
for para in htmlParse.find_all("p"):
    print(para.get_text())
Use find_all() with text=True to find all text nodes and recursive=False to search only among direct children of the parent p tag:
from bs4 import BeautifulSoup
data = """
<p>
    <a name="533660373"></a>
    <strong>Title: Point of Sale Threats Proliferate</strong><br />
    <strong>Severity: Normal Severity</strong><br />
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
    <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
    <br />
</p>
"""
soup = BeautifulSoup(data)
print ''.join(text.strip() for text in soup.p.find_all(text=True, recursive=False))
Prints:
Several new Point of Sale malware families have emerged recently, to include LusyPOS,..
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With