I'm currently trying to scrape a website that has fairly poorly-formatted HTML (often missing closing tags, no use of classes or ids so it's incredibly difficult to go straight to the element you want, etc.). I've been using BeautifulSoup with some success so far but every once and a while (though quite rarely), I run into a page where BeautifulSoup creates the HTML tree a bit differently from (for example) Firefox or Webkit. While this is understandable as the formatting of the HTML leaves this ambiguous, if I were able to get the same parse tree as Firefox or Webkit produces I would be able to parse things much more easily.
The problems are usually something like the site opens a <b>
tag twice and when BeautifulSoup sees the second <b>
tag, it immediately closes the first while Firefox and Webkit nest the <b>
tags.
Is there a web scraping library for Python (or even any other language (I'm getting desperate)) that can reproduce the parse tree generated by Firefox or WebKit (or at least get closer than BeautifulSoup in cases of ambiguity).
Scraping for personal purposes is usually OK, even if it is copyrighted information, as it could fall under the fair use provision of the intellectual property legislation. However, sharing data for which you don't hold the right to share is illegal.
Web scraping is completely legal if you scrape data publicly available on the internet. But some kinds of data are protected by international regulations, so be careful scraping personal data, intellectual property, or confidential data.
So is it legal or illegal? Web scraping and crawling aren't illegal by themselves. After all, you could scrape or crawl your own website, without a hitch. Startups love it because it's a cheap and powerful way to gather data without the need for partnerships.
Large Collection of Libraries: Python has a huge collection of libraries such as Numpy, Matlplotlib, Pandas etc., which provides methods and services for various purposes. Hence, it is suitable for web scraping and for further manipulation of extracted data.
Use BeautifulSoup
as a tree builder for html5lib
:
from html5lib import HTMLParser, treebuilders
parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
text = "a<b>b<b>c"
soup = parser.parse(text)
print soup.prettify()
Output:
<html>
<head>
</head>
<body>
a
<b>
b
<b>
c
</b>
</b>
</body>
</html>
pyWebKitGTK looks like it might be of some help.
Also here is a dude that had to do the same thing but get the export of the content after javascript ran, execute javascript from python using pyWebKitGTK.
pyWebkitGTK at the cheeseshop.
You can also do this with pyQt.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With