Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using beautifulsoup to extract text between line breaks (e.g. <br /> tags)

Tags:

I have the following HTML that is within a larger document

<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />

I'm currently using BeautifulSoup to obtain other elements within the HTML, but I have not been able to find a way to get the important lines of text between <br /> tags. I can isolate and navigate to each of the <br /> elements, but can't find a way to get the text in between. Any help would be greatly appreciated. Thanks.

like image 215
maltman Avatar asked Mar 11 '11 16:03

maltman


People also ask

How do you get text with the BR tag in BeautifulSoup?

BeautifulSoup get text with <br> tags You can use get_text() with an undocumented separator parameter to get the text inside the div like so. Alternatively, you can replace every single <br> tag with an unique string of your choice, then once you get the output, replace that string back to newlines.

Can BeautifulSoup handle broken HTML?

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.

How do you scrape a tag with BeautifulSoup?

Step-by-step Approach. Step 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.

Can BeautifulSoup parse HTML?

The HTML content of the webpages can be parsed and scraped with Beautiful Soup.


2 Answers

If you just want any text which is between two <br /> tags, you could do something like the following:

from BeautifulSoup import BeautifulSoup, NavigableString, Tag

input = '''<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />'''

soup = BeautifulSoup(input)

for br in soup.findAll('br'):
    next_s = br.nextSibling
    if not (next_s and isinstance(next_s,NavigableString)):
        continue
    next2_s = next_s.nextSibling
    if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
        text = str(next_s).strip()
        if text:
            print "Found:", next_s

But perhaps I misunderstand your question? Your description of the problem doesn't seem to match up with the "important" / "non important" in your example data, so I've gone with the description ;)

like image 81
Mark Longair Avatar answered Sep 18 '22 14:09

Mark Longair


So, for test purposes, let's assume that this chunk of HTML is inside a span tag:

x = """<span><br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br /></span>"""

Now I'm going to parse it and find my span tag:

from BeautifulSoup import BeautifulSoup
y = soup.find('span')

If you iterate over the generator in y.childGenerator(), you will get both the br's and the text:

In [4]: for a in y.childGenerator(): print type(a), str(a)
   ....: 
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 1

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Not Important Text

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 2

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 3

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Non Important Text

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 4

<type 'instance'> <br />
like image 30
Ken Kinder Avatar answered Sep 19 '22 14:09

Ken Kinder