Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML into sentences - how to handle tables/lists/headings/etc?

How do you go about parsing an HTML page with free text, lists, tables, headings, etc., into sentences?

Take this wikipedia page for example. There is/are:

  • free text: http://en.wikipedia.org/wiki/Neurotransmitter#Discovery
  • lists: http://en.wikipedia.org/wiki/Neurotransmitter#Actions
  • tables: http://en.wikipedia.org/wiki/Neurotransmitter#Common_neurotransmitters

After messing around with the python NLTK, I want to test out all of these different corpus annotation methods (from http://nltk.googlecode.com/svn/trunk/doc/book/ch11.html#deciding-which-layers-of-annotation-to-include):

  • Word Tokenization: The orthographic form of text does not unambiguously identify its tokens. A tokenized and normalized version, in addition to the conventional orthographic version, may be a very convenient resource.
  • Sentence Segmentation: As we saw in Chapter 3, sentence segmentation can be more difficult than it seems. Some corpora therefore use explicit annotations to mark sentence segmentation.
  • Paragraph Segmentation: Paragraphs and other structural elements (headings, chapters, etc.) may be explicitly annotated.
  • Part of Speech: The syntactic category of each word in a document.
  • Syntactic Structure: A tree structure showing the constituent structure of a sentence.
  • Shallow Semantics: Named entity and coreference annotations, semantic role labels.
  • Dialogue and Discourse: dialogue act tags, rhetorical structure

Once you break a document into sentences it seems pretty straightforward. But how do you go about breaking down something like the HTML from that Wikipedia page? I am very familiar with using HTML/XML parsers and traversing the tree, and I have tried just stripping the HTML tags to get the plain text, but because punctuation is missing after HTML is removed, NLTK doesn't parse things like table cells, or even lists, correctly.

Is there some best-practice or strategy for parsing that stuff with NLP? Or do you just have to manually write a parser specific to that individual page?

Just looking for some pointers in the right direction, really want to try this NLTK out!

like image 840
Lance Avatar asked Jun 30 '12 20:06

Lance


3 Answers

Sounds like you're stripping all HTML and generating a flat document, which confuses the parser since the loose pieces are stuck together. Since you are experienced with XML, I suggest mapping your inputs to a simple XML structure that keeps the pieces separate. You can make it as simple as you want, but perhaps you'll want to retain some information. E.g., it may be useful to flag titles, section headings etc. as such. When you've got a workable XML tree that keeps the chunks separate, use XMLCorpusReader to import it into the NLTK universe.

like image 111
alexis Avatar answered Nov 11 '22 01:11

alexis


I had to write rules specific to the XML docs I was analyzing.

What I did was to have a mapping of html tags to segments. This mapping was based on studying several docs/pages and determining what the html tags represent. Ex. <h1> is a phrase segment; <li> are paragraphs; <td> are tokens

If you want to work with XML, you can represent the new mappings as tags. Ex. <h1> to <phrase>; <li> to <paragraph>; <td> to <token>

If you want to work on plain text, you can represent the mappings as a set of chars (ex. [PHRASESTART][PHRASEEND]), just like POS or EOS labeling.

like image 44
ezio808 Avatar answered Nov 11 '22 02:11

ezio808


You can use tools like python-goose which aims at extracting articles from html pages.

Otherwise I made the following small program that gives kind of good results:

from html5lib import parse


with open('page.html') as f:
    doc = parse(f.read(), treebuilder='lxml', namespaceHTMLElements=False)

html = doc.getroot()
body = html.xpath('//body')[0]


def sanitize(element):
    """Retrieve all the text contained in an element as a single line of
    text. This must be executed only on blocks that have only inlines
    as children
    """
    # join all the strings and remove \n
    out = ' '.join(element.itertext()).replace('\n', ' ')
    # replace multiple space with a single space
    out = ' '.join(out.split())
    return out


def parse(element):
    # those elements can contain other block inside them
    if element.tag in ['div', 'li', 'a', 'body', 'ul']:
        if element.text is None or element.text.isspace():
            for child in element.getchildren():
                yield from parse(child)
        else:
            yield sanitize(element)
    # those elements are "guaranteed" to contains only inlines
    elif element.tag in ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
        yield sanitize(element)
    else:
        try:
            print('> ignored', element.tag)
        except:
            pass


for e in filter(lambda x: len(x) > 80, parse(body)):
    print(e)
like image 1
amirouche Avatar answered Nov 11 '22 03:11

amirouche