Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web scraping with Python [closed]

I'm currently trying to scrape a website that has fairly poorly-formatted HTML (often missing closing tags, no use of classes or ids so it's incredibly difficult to go straight to the element you want, etc.). I've been using BeautifulSoup with some success so far but every once and a while (though quite rarely), I run into a page where BeautifulSoup creates the HTML tree a bit differently from (for example) Firefox or Webkit. While this is understandable as the formatting of the HTML leaves this ambiguous, if I were able to get the same parse tree as Firefox or Webkit produces I would be able to parse things much more easily. The problems are usually something like the site opens a <b> tag twice and when BeautifulSoup sees the second <b> tag, it immediately closes the first while Firefox and Webkit nest the <b> tags.

Is there a web scraping library for Python (or even any other language (I'm getting desperate)) that can reproduce the parse tree generated by Firefox or WebKit (or at least get closer than BeautifulSoup in cases of ambiguity).

like image 847
Jack Edmonds Avatar asked Mar 07 '10 18:03

Jack Edmonds


People also ask

Is web scraping with Python legal?

Scraping for personal purposes is usually OK, even if it is copyrighted information, as it could fall under the fair use provision of the intellectual property legislation. However, sharing data for which you don't hold the right to share is illegal.

Why web scraping is not allowed?

Web scraping is completely legal if you scrape data publicly available on the internet. But some kinds of data are protected by international regulations, so be careful scraping personal data, intellectual property, or confidential data.

Can you get in trouble for web scraping?

So is it legal or illegal? Web scraping and crawling aren't illegal by themselves. After all, you could scrape or crawl your own website, without a hitch. Startups love it because it's a cheap and powerful way to gather data without the need for partnerships.

Is Python good for data scraping?

Large Collection of Libraries: Python has a huge collection of libraries such as Numpy, Matlplotlib, Pandas etc., which provides methods and services for various purposes. Hence, it is suitable for web scraping and for further manipulation of extracted data.


2 Answers

Use BeautifulSoup as a tree builder for html5lib:

from html5lib import HTMLParser, treebuilders

parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))

text = "a<b>b<b>c"
soup = parser.parse(text)
print soup.prettify()

Output:

<html>
 <head>
 </head>
 <body>
  a
  <b>
   b
   <b>
    c
   </b>
  </b>
 </body>
</html>
like image 83
jfs Avatar answered Oct 23 '22 16:10

jfs


pyWebKitGTK looks like it might be of some help.

Also here is a dude that had to do the same thing but get the export of the content after javascript ran, execute javascript from python using pyWebKitGTK.

pyWebkitGTK at the cheeseshop.

You can also do this with pyQt.

like image 35
Ryan Christensen Avatar answered Oct 23 '22 18:10

Ryan Christensen