I have the following HTML that is within a larger document
<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />
I'm currently using BeautifulSoup to obtain other elements within the HTML, but I have not been able to find a way to get the important lines of text between <br />
tags. I can isolate and navigate to each of the <br />
elements, but can't find a way to get the text in between. Any help would be greatly appreciated. Thanks.
BeautifulSoup get text with <br> tags You can use get_text() with an undocumented separator parameter to get the text inside the div like so. Alternatively, you can replace every single <br> tag with an unique string of your choice, then once you get the output, replace that string back to newlines.
BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.
Step-by-step Approach. Step 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.
The HTML content of the webpages can be parsed and scraped with Beautiful Soup.
If you just want any text which is between two <br />
tags, you could do something like the following:
from BeautifulSoup import BeautifulSoup, NavigableString, Tag
input = '''<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />'''
soup = BeautifulSoup(input)
for br in soup.findAll('br'):
next_s = br.nextSibling
if not (next_s and isinstance(next_s,NavigableString)):
continue
next2_s = next_s.nextSibling
if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
text = str(next_s).strip()
if text:
print "Found:", next_s
But perhaps I misunderstand your question? Your description of the problem doesn't seem to match up with the "important" / "non important" in your example data, so I've gone with the description ;)
So, for test purposes, let's assume that this chunk of HTML is inside a span
tag:
x = """<span><br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br /></span>"""
Now I'm going to parse it and find my span tag:
from BeautifulSoup import BeautifulSoup
y = soup.find('span')
If you iterate over the generator in y.childGenerator()
, you will get both the br's and the text:
In [4]: for a in y.childGenerator(): print type(a), str(a)
....:
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 1
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Not Important Text
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 2
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 3
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Non Important Text
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 4
<type 'instance'> <br />
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With