Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text inside HTML paragraph using BeautifulSoup in Python

<p>
    <a name="533660373"></a>
    <strong>Title: Point of Sale Threats Proliferate</strong><br />
    <strong>Severity: Normal Severity</strong><br />
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
    <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
    <br />
</p>

This is a paragraph I want to extact from an HTML page using BeautifulSoup in Python. I am able to get values inside tags, using the .children & .string methods. But I am unable to get the text "Several new Point of Sale malware fa..." which is inside paragraph without any tag. I tried using soup.p.text , .get_text() etc.. but no use.

like image 610
Remis Haroon - رامز Avatar asked Sep 16 '25 21:09

Remis Haroon - رامز


2 Answers

import urllib.request
from bs4 import BeautifulSoup

url = "https://www.geeksforgeeks.org/how-to-automate-an-excel-sheet-in-python/?ref=feed"

html = urllib.request.urlopen(url)

htmlParse = BeautifulSoup(html, 'html.parser')

for para in htmlParse.find_all("p"):
    print(para.get_text())
like image 130
golden_boy Avatar answered Sep 18 '25 17:09

golden_boy


Use find_all() with text=True to find all text nodes and recursive=False to search only among direct children of the parent p tag:

from bs4 import BeautifulSoup

data = """
<p>
    <a name="533660373"></a>
    <strong>Title: Point of Sale Threats Proliferate</strong><br />
    <strong>Severity: Normal Severity</strong><br />
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
    <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
    <br />
</p>
"""

soup = BeautifulSoup(data)
print ''.join(text.strip() for text in soup.p.find_all(text=True, recursive=False))

Prints:

Several new Point of Sale malware families have emerged recently, to include LusyPOS,..
like image 41
alecxe Avatar answered Sep 18 '25 16:09

alecxe