Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding a space between paragraphs when extracting text with BeautifulSoup

I need to extract useful text from news articles. I do it with BeautifulSoup but the output sticks together some paragraphs which prevents me from analysing the text further.

My code:

import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.bbc.co.uk/news/uk-england-39607452")
soup = BeautifulSoup(r.content, "lxml")

# delete unwanted tags:
for s in soup(['figure', 'script', 'style']):
    s.decompose()

article_soup = [s.get_text() for s in soup.find_all(
                'div', {'class': 'story-body__inner'})]
article = ''.join(article_soup)
print(article)

The output looks like this (just first 5 sentences):

The family of British student Hannah Bladon, who was stabbed to death in Jerusalem, have said they are "devastated" by the "senseless and tragic attack".Ms Bladon, 20, was attacked on a tram in Jerusalem on Good Friday.She was studying at the Hebrew University of Jerusalem at the time of her death and had been taking part in an archaeological dig that morning.Ms Bladon was stabbed several times in the chest and died in hospital. She was attacked by a man who pulled a knife from his bag and repeatedly stabbed her on the tram travelling near Old City, which was busy as Christians marked Good Friday and Jews celebrated Passover.

I tried adding a space after certain punctuations like ".", "?", and "!".

article = article.replace(".", ". ")

It works with paragraphs (although I believe there should be a smarter way of doing this) but not with subtitles for different sections of the articles which don't have any punctuation in the end. They are structured like this:

</p>
<h2 class="story-body__crosshead">
Subtitle text
</h2>
<p>

I will be grateful for your advice.

PS: adding a space when I 'join' the article_soup doesn't help.

like image 731
aviss Avatar asked Dec 11 '22 11:12

aviss


1 Answers

You can use separator in your get_text, which will fetch all the strings in the current element separated by the given character.

article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]
like image 82
Zroq Avatar answered Apr 26 '23 21:04

Zroq