Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best Python Module for HTML parsing [closed]

I have a website updater(people can update content(text) not the look of the website) which has HTML, javascript as the front end languages & python as the back-end/server side.

I am finding that updating HTML is very difficult from the front end because when I grab the updated HTML by ele.innerHTML or $(ele).html() gives altered HTML depending on the browser(DAMN IE).

So I have decided to update my HTML from the backend, ie, in Python

What do you think is the best python module to parse HTML & grab information?

My requirements are:
- that the module be atleast in Python 2.5 or less(because of my webhost)
- I will be parsing HTML & finding all the HTML elements that are of the class "updatable"
- For each element of the class "updatable": extract the innerText(not html only text/content)

Which python module would you suggest is best for this?
- HTMLParser.py
- htmllib.py
- know of any other python 2.5 compatible modules?

like image 355
sazr Avatar asked Oct 04 '11 23:10

sazr


2 Answers

For parsing HTML I would suggest you take a look at Beautiful Soup. It's pretty powerful and can deal with some messed up markup as well.

http://www.crummy.com/software/BeautifulSoup/

Check this out and see if it helps you out! Hope it does.

like image 89
pcalcao Avatar answered Sep 19 '22 01:09

pcalcao


I've been using lxml ( http://lxml.de/lxmlhtml.html ). It relatively fast for normal sized html documents and has support for using BeautifulSoup. As I understand it, BeautifulSoup is no longer supported so for all new projects I've used lxml.

like image 39
David Avatar answered Sep 18 '22 01:09

David