Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Options for HTML scraping? [closed]

People also ask

Can you get blocked for web scraping?

IP Rotation So, for every successful scraping request, you must use a new IP for every request. You must have a pool of at least 10 IPs before making an HTTP request. To avoid getting blocked you can use proxy rotating services like Scrapingdog or any other Proxy services.

What are the alternative of web scraping?

Other great sites and apps similar to Web Scraper are Scrapy, Portia, ParseHub and UiPath. Web Scraper alternatives are mainly Web Scraping Tools but may also be Task Automation Apps or Workflow Automation Tools.


The Ruby world's equivalent to Beautiful Soup is why_the_lucky_stiff's Hpricot.


In the .NET world, I recommend the HTML Agility Pack. Not near as simple as some of the above options (like HTMLSQL), but it's very flexible. It lets you maniuplate poorly formed HTML as if it were well formed XML, so you can use XPATH or just itereate over nodes.

http://www.codeplex.com/htmlagilitypack


BeautifulSoup is a great way to go for HTML scraping. My previous job had me doing a lot of scraping and I wish I knew about BeautifulSoup when I started. It's like the DOM with a lot more useful options and is a lot more pythonic. If you want to try Ruby they ported BeautifulSoup calling it RubyfulSoup but it hasn't been updated in a while.

Other useful tools are HTMLParser or sgmllib.SGMLParser which are part of the standard Python library. These work by calling methods every time you enter/exit a tag and encounter html text. They're like Expat if you're familiar with that. These libraries are especially useful if you are going to parse very large files and creating a DOM tree would be long and expensive.

Regular expressions aren't very necessary. BeautifulSoup handles regular expressions so if you need their power you can utilize it there. I say go with BeautifulSoup unless you need speed and a smaller memory footprint. If you find a better HTML parser on Python, let me know.


I found HTMLSQL to be a ridiculously simple way to screenscrape. It takes literally minutes to get results with it.

The queries are super-intuitive - like:

SELECT title from img WHERE $class == 'userpic'

There are now some other alternatives that take the same approach.


The Python lxml library acts as a Pythonic binding for the libxml2 and libxslt libraries. I like particularly its XPath support and pretty-printing of the in-memory XML structure. It also supports parsing broken HTML. And I don't think you can find other Python libraries/bindings that parse XML faster than lxml.


For Perl, there's WWW::Mechanize.


Python has several options for HTML scraping in addition to Beatiful Soup. Here are some others:

  • mechanize: similar to perl WWW:Mechanize. Gives you a browser like object to ineract with web pages
  • lxml: Python binding to libwww. Supports various options to traverse and select elements (e.g. XPath and CSS selection)
  • scrapemark: high level library using templates to extract informations from HTML.
  • pyquery: allows you to make jQuery like queries on XML documents.
  • scrapy: an high level scraping and web crawling framework. It can be used to write spiders, for data mining and for monitoring and automated testing