Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Suggestions on get_text() in BeautifulSoup

I am using BeautifulSoup to parse some content from a html page.

I can extract from the html the content I want (i.e. the text contained in a span defined by the class myclass).

result = mycontent.find(attrs={'class':'myclass'})

I obtain this result:

<span class="myclass">Lorem ipsum<br/>dolor sit amet,<br/>consectetur...</span>

If I try to extract the text using:

result.get_text()

I obtain:

Lorem ipsumdolor sit amet,consectetur...

As you can see when the tag <br> is removed there is no more spacing between the contents and two words are concated.

How can I solve this issue?

like image 440
user601836 Avatar asked Apr 20 '13 13:04

user601836


People also ask

How do you find a specific text tag in BeautifulSoup?

Approach: Here we first import the regular expressions and BeautifulSoup libraries. Then we open the HTML file using the open function which we want to parse. Then using the find_all function, we find a particular tag that we pass inside that function and also the text we want to have within the tag.

What is the use of beautifulsoup4?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.


1 Answers

If you are using bs4 you can use strings:

" ".join(result.strings)
like image 133
Sean Vieira Avatar answered Sep 25 '22 11:09

Sean Vieira