Web scraping with Python [closed]

Tags:

I'm currently trying to scrape a website that has fairly poorly-formatted HTML (often missing closing tags, no use of classes or ids so it's incredibly difficult to go straight to the element you want, etc.). I've been using BeautifulSoup with some success so far but every once and a while (though quite rarely), I run into a page where BeautifulSoup creates the HTML tree a bit differently from (for example) Firefox or Webkit. While this is understandable as the formatting of the HTML leaves this ambiguous, if I were able to get the same parse tree as Firefox or Webkit produces I would be able to parse things much more easily. The problems are usually something like the site opens a <b> tag twice and when BeautifulSoup sees the second <b> tag, it immediately closes the first while Firefox and Webkit nest the <b> tags.

Is there a web scraping library for Python (or even any other language (I'm getting desperate)) that can reproduce the parse tree generated by Firefox or WebKit (or at least get closer than BeautifulSoup in cases of ambiguity).

847

asked Mar 07 '10 18:03

Jack Edmonds

2 Answers

Use BeautifulSoup as a tree builder for html5lib:

from html5lib import HTMLParser, treebuilders

parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))

text = "a<b>b<b>c"
soup = parser.parse(text)
print soup.prettify()

Output:

<html>
 <head>
 </head>
 <body>
  a
  <b>
   b
   <b>
    c
   </b>
  </b>
 </body>
</html>

answered Oct 23 '22 16:10

jfs

pyWebKitGTK looks like it might be of some help.

Also here is a dude that had to do the same thing but get the export of the content after javascript ran, execute javascript from python using pyWebKitGTK.

pyWebkitGTK at the cheeseshop.

You can also do this with pyQt.

answered Oct 23 '22 18:10

Ryan Christensen

Related questions
                            
                                How to make a command case insensitive in discord.py
                            
                                Duplicating training examples to handle class imbalance in a pandas data frame
                            
                                WebRTC Python implementation
                            
                                Does applying a Dropout Layer after the Embedding Layer have the same effect as applying the dropout through the LSTM dropout parameter?
                            
                                The fastest way to exclude surrounding zeros from an array representing an image?
                            
                                How to convert a datetime format to minutes - pandas
                            
                                Flask SQLAlchemy - set expire_on_commit=False only for current session
                            
                                Birthday Paradox, incorrect output by about 1
                            
                                Python's list comprehension: Modify list elements if a certain value occurs
                            
                                Is there an easy way to get the number of repeating character in a word?
                            
                                Python - UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1070: character maps to <undefined>
                            
                                Conda install screwed up my environment. Can I undo it?
                            
                                pandas: How to transform all numeric columns of a data frame into logarithms
                            
                                Can I declare Python class fields outside the constructor method?
                            
                                vscode python refactor failed
                            
                                Missing dependancies of rtree
                            
                                How to count the number of dashes between any two alphabetical characters?
                            
                                Discord Bot can only see itself and no other users in guild
                            
                                How to sort digits in a number?
                            
                                Python Modules most worthwhile reading

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Web scraping with Python [closed]

Tags:

python

firefox

webkit

web-scraping

Jack Edmonds

People also ask

2 Answers

jfs

Ryan Christensen

Recent Activity

Donate For Us