Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web Scraping With Haskell

What is the current state of libraries for scraping websites with Haskell?

I'm trying to make myself do more of my quick oneoff tasks in Haskell, in order to help increase my comfort level with the language.

In Python, I tend to use the excellent PyQuery library for this. Is there something similarly simple and easy in Haskell? I've looked into Tag Soup, and while the parser itself seems nice, actually traversing pages doesn't seem as nice as it is in other languages.

Is there a better option out there?

like image 844
ricree Avatar asked Jan 29 '11 17:01

ricree


People also ask

Which language is best for web scraping?

Python. Python is mostly known as the best web scraper language. It's more like an all-rounder and can handle most of the web crawling-related processes smoothly. Beautiful Soup is one of the most widely used frameworks based on Python that makes scraping using this language such an easy route to take.

Is scraping website legal?

So is it legal or illegal? Web scraping and crawling aren't illegal by themselves. After all, you could scrape or crawl your own website, without a hitch. Startups love it because it's a cheap and powerful way to gather data without the need for partnerships.

Do hackers use web scraping?

For this purpose smart web scraping is your number one growth hacker tool. Developing strong, reliable leads has always been a key feature of web scraping, and it's as simple as understanding where your target audience is active online and scraping those sites for specific information.

Is C++ good for web scraping?

C++ is highly scalable. If you start with a small project and decide that web scraping is for you, most of the code is reusable. A few tweaks here and there, and you'll be ready for much larger data volumes. On the other hand, C++ is a static programming language.


2 Answers

http://hackage.haskell.org/package/shpider

Shpider is a web automation library for Haskell. It allows you to quickly write crawlers, and for simple cases ( like following links ) even without reading the page source.

It has useful features such as turning relative links from a page into absolute links, options to authorize transactions only on a given domain, and the option to only download html documents.

It also provides a nice syntax for filling out forms.

An example:

 runShpider $ do       download "http://apage.com"       theForm : _ <- getFormsByAction "http://anotherpage.com"       sendForm $ fillOutForm theForm $ pairs $ do             "occupation" =: "unemployed Haskell programmer"             "location" =: "mother's house" 

(Edit in 2018 -- shpider is deprecated, these days https://hackage.haskell.org/package/scalpel might be a good replacement)

like image 138
sclv Avatar answered Oct 02 '22 23:10

sclv


From my searching on the Haskell mailing lists, it appears that TagSoup is the dominant choice for parsing pages. For example: http://www.haskell.org/pipermail/haskell-cafe/2008-August/045721.html

As far as the other aspects of web scraping (such as crawling, spidering, and caching), I searched http://hackage.haskell.org/package/ for those keywords but didn't find anything promising. I even skimmed through packages mentioning "http" but nothing jumped out at me.

Note: I'm not a regular Haskeller, so I hope others can chime in if I missed something.

like image 26
David J. Avatar answered Oct 03 '22 01:10

David J.