How do you go about parsing an HTML page with free text, lists, tables, headings, etc., into sentences?
Take this wikipedia page for example. There is/are:
After messing around with the python NLTK, I want to test out all of these different corpus annotation methods (from http://nltk.googlecode.com/svn/trunk/doc/book/ch11.html#deciding-which-layers-of-annotation-to-include):
Once you break a document into sentences it seems pretty straightforward. But how do you go about breaking down something like the HTML from that Wikipedia page? I am very familiar with using HTML/XML parsers and traversing the tree, and I have tried just stripping the HTML tags to get the plain text, but because punctuation is missing after HTML is removed, NLTK doesn't parse things like table cells, or even lists, correctly.
Is there some best-practice or strategy for parsing that stuff with NLP? Or do you just have to manually write a parser specific to that individual page?
Just looking for some pointers in the right direction, really want to try this NLTK out!
Sounds like you're stripping all HTML and generating a flat document, which confuses the parser since the loose pieces are stuck together. Since you are experienced with XML, I suggest mapping your inputs to a simple XML structure that keeps the pieces separate. You can make it as simple as you want, but perhaps you'll want to retain some information. E.g., it may be useful to flag titles, section headings etc. as such. When you've got a workable XML tree that keeps the chunks separate, use XMLCorpusReader
to import it into the NLTK universe.
I had to write rules specific to the XML docs I was analyzing.
What I did was to have a mapping of html tags to segments. This mapping was based on studying several docs/pages and determining what the html tags represent. Ex. <h1> is a phrase segment; <li> are paragraphs; <td> are tokens
If you want to work with XML, you can represent the new mappings as tags. Ex. <h1> to <phrase>; <li> to <paragraph>; <td> to <token>
If you want to work on plain text, you can represent the mappings as a set of chars (ex. [PHRASESTART][PHRASEEND]), just like POS or EOS labeling.
You can use tools like python-goose which aims at extracting articles from html pages.
Otherwise I made the following small program that gives kind of good results:
from html5lib import parse
with open('page.html') as f:
doc = parse(f.read(), treebuilder='lxml', namespaceHTMLElements=False)
html = doc.getroot()
body = html.xpath('//body')[0]
def sanitize(element):
"""Retrieve all the text contained in an element as a single line of
text. This must be executed only on blocks that have only inlines
as children
"""
# join all the strings and remove \n
out = ' '.join(element.itertext()).replace('\n', ' ')
# replace multiple space with a single space
out = ' '.join(out.split())
return out
def parse(element):
# those elements can contain other block inside them
if element.tag in ['div', 'li', 'a', 'body', 'ul']:
if element.text is None or element.text.isspace():
for child in element.getchildren():
yield from parse(child)
else:
yield sanitize(element)
# those elements are "guaranteed" to contains only inlines
elif element.tag in ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
yield sanitize(element)
else:
try:
print('> ignored', element.tag)
except:
pass
for e in filter(lambda x: len(x) > 80, parse(body)):
print(e)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With