I am using BeautifulSoup to parse some content from a html page.
I can extract from the html the content I want (i.e. the text contained in a span
defined by the class
myclass).
result = mycontent.find(attrs={'class':'myclass'})
I obtain this result:
<span class="myclass">Lorem ipsum<br/>dolor sit amet,<br/>consectetur...</span>
If I try to extract the text using:
result.get_text()
I obtain:
Lorem ipsumdolor sit amet,consectetur...
As you can see when the tag <br>
is removed there is no more spacing between the contents and two words are concated.
How can I solve this issue?
Approach: Here we first import the regular expressions and BeautifulSoup libraries. Then we open the HTML file using the open function which we want to parse. Then using the find_all function, we find a particular tag that we pass inside that function and also the text we want to have within the tag.
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
If you are using bs4 you can use strings
:
" ".join(result.strings)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With