Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Gecko/Firefox or Webkit got HTML parsing in python

I am using BeautifulSoup and urllib2 for downloading HTML pages and parsing them. Problem is with mis formed HTML pages. Though BeautifulSoup is good at handling mis formed HTML still its not as good as Firefox.

Considering that Firefox or Webkit are more updated and resilient at handling HTML I think its ideal to use them to construct and normalize DOM tree of a page and then manipulate it through Python.

However I cant find any python binding for the same. Can anyone suggest a way ?

I ran into some solutions of running a headless Firefox process and manipulating it through python but is there a more pythonic solution available.

like image 750
user90147 Avatar asked Apr 22 '09 22:04

user90147


1 Answers

Perhaps pywebkitgtk would do what you need.

like image 160
vezult Avatar answered Oct 04 '22 21:10

vezult