I would like to process a html website (e.g. this one: http://www.uni-bremen.de/mscmarbiol/) and save each sentence, which contains a string 'research'.
This is just an example of the codes with which I pulled all the text from the website.
from bs4 import BeautifulSoup
from zipfile import ZipFile
import os
html_page = "example.html" #i saved this page as example locally
data = []
with open(html_page, "r") as html:
soup = BeautifulSoup(html, "lxml")
text_group = soup.get_text()
print text_group
What would be the best way to perform a task of exporting only the sentences which contain the word 'research'?
Is there a more elegant way than using .split and seperators for a string? Can something be done with "re"?
Thank you very much for your help as I am very much new to this topic.
Best regards,
Trgovec
Considering "sentences" aren't strictly defined in the document, it sounds like you will need to use a tool that splits plaintext into sentences.
The NLTK package is great for this kind of thing. You will want to do something like
import nltk
sentences = nltk.sent_tokenize(text)
result = [sentence for sentence in sentences if "research" in sentence]
It's not perfect (it doesn't understand that "The M.Sc." in your document is not a separate sentence for instance), but sentence segmentation is a deceptively complex task and this is as good as you'll get.
Once you have your soup, you may try:
for tag in soup.descendants:
if tag.string and 'research' in tag.string:
print(tag.string)
Faster alternative using XPath, since you have lxml
installed:
from lxml import etree
with open(html_page, "r") as html:
tree = etree.parse(html, parser=etree.HTMLParser())
[e.text for e in tree.xpath("//*[contains(text(), 'research')]")]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With