i am working on web scraping and i want just text from any website so i am using Beautiful Soup. Initially i found that get_text() method was also returning JavaScript code so to avoid i come across that i should use extract() method but now i have a weird problem that after extraction of script and style tag Beautiful Soup doesn't recognize its body even its present in new `html.
let me clear you first i was doing this
soup = BeautifulSoup(HTMLRawData, 'html.parser')
print(soup.body)
here print statement was printing all html data
but when i do
soup = BeautifulSoup(rawData, 'html.parser')
for script in soup(["script", "style"]):
script.extract() # rip it out
print(soup.body)
Now its is printing None as element is not present but for debugging after that i did soup.prettify() then it print whole html including body tag and also there was no script and style tag :( now i am very confused that why its happening and if body is present than why its saying None please help thanks
and i am using Python 3 and bs4 and rawData is html extracted from website .
Problem: Using this html example:
<html>
<style>just style</style>
<span>Main text.</span>
</html>
After extracting the style tag and calling get_text() it returns only the text it was supposed to remove. This due to a double newline in the html after using extract(). Call soup.contents before and after .extract() and you will see this issue.
Before extract():
[<html>\n<style>just style</style>\n<span>Main text.</span>\n</html>]
After extract():
[<html>\n\n<span>Main text.</span>\n</html>]
You can see the double newline between html and span. This issue brakes get_text() for some unknown reason. To validate this point remove the newlines in the example and it will work properly.
Solutions:
1.- Parse the soup again after the extract() call.
BeautifulSoup(str(soup), 'html.parser')
2.- Use a different parser.
BeautifulSoup(raw, 'html5lib')
Note: Solution #2 doesn't work if you extract two or more contiguous tags because you end up with double newline again.
Note: You will probably have to install this parser. Just do:
pip install html5lib
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With