Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML parser in Python [closed]

Tags:

python

import

Using the Python Documentation I found the HTML parser but I have no idea which library to import to use it, how do I find this out (bearing in mind it doesn't say on the page).

like image 800
Teifion Avatar asked Sep 16 '08 10:09

Teifion


People also ask

What class does Python provide to parse HTML?

The HTMLParser class defined in this module provides functionality to parse HTML and XHMTL documents. This class contains handler methods that can identify tags, data, comments and other HTML elements.

Which methods are available through HTML parser?

An HTMLParser instance is fed HTML data and calls handler methods when start tags, end tags, text, comments, and other markup elements are encountered.

What does HTML parser do in Beautifulsoup?

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.


4 Answers

You probably really want BeautifulSoup, check the link for an example.

But in any case

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.feed('<html></html>')
>>> h.get_starttag_text()
'<html>'
>>> h.close()
like image 127
Vinko Vrsalovic Avatar answered Oct 05 '22 23:10

Vinko Vrsalovic


Try:

import HTMLParser

In Python 3.0, the HTMLParser module has been renamed to html.parser you can check about this here

Python 3.0

import html.parser

Python 2.2 and above

import HTMLParser
like image 25
1077 Avatar answered Oct 05 '22 23:10

1077


I would recommend using Beautiful Soup module instead and it has good documentation.

like image 32
Swaroop C H Avatar answered Oct 06 '22 01:10

Swaroop C H


You may be interested in lxml. It is a separate package and has C components, but is the fastest. It has also very nice API, allowing you to easily list links in HTML documents, or list forms, sanitize HTML, and more. It also has capabilities to parse not well-formed HTML (it's configurable).

like image 29
Paweł Hajdan Avatar answered Oct 06 '22 00:10

Paweł Hajdan