Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

beautifulsoup extract sentence, if it contains a keyword

I would like to process a html website (e.g. this one: http://www.uni-bremen.de/mscmarbiol/) and save each sentence, which contains a string 'research'.

This is just an example of the codes with which I pulled all the text from the website.

from bs4 import BeautifulSoup
from zipfile import ZipFile
import os
html_page = "example.html" #i saved this page as example locally

data = []
with open(html_page, "r") as html:
    soup = BeautifulSoup(html, "lxml")
    text_group = soup.get_text()

print text_group

What would be the best way to perform a task of exporting only the sentences which contain the word 'research'?

Is there a more elegant way than using .split and seperators for a string? Can something be done with "re"?

Thank you very much for your help as I am very much new to this topic.

Best regards,

Trgovec

like image 634
Trgovec Avatar asked Oct 14 '25 04:10

Trgovec


2 Answers

Considering "sentences" aren't strictly defined in the document, it sounds like you will need to use a tool that splits plaintext into sentences.

The NLTK package is great for this kind of thing. You will want to do something like

import nltk
sentences = nltk.sent_tokenize(text)
result = [sentence for sentence in sentences if "research" in sentence]

It's not perfect (it doesn't understand that "The M.Sc." in your document is not a separate sentence for instance), but sentence segmentation is a deceptively complex task and this is as good as you'll get.

like image 91
Denziloe Avatar answered Oct 16 '25 18:10

Denziloe


Once you have your soup, you may try:

for tag in soup.descendants:
    if tag.string and 'research' in tag.string:
       print(tag.string)

Faster alternative using XPath, since you have lxml installed:

from lxml import etree
with open(html_page, "r") as html:
    tree = etree.parse(html, parser=etree.HTMLParser())
[e.text for e in tree.xpath("//*[contains(text(), 'research')]")]
like image 31
Guillaume Avatar answered Oct 16 '25 18:10

Guillaume