Is Erlang the right choice for a webcrawler?

Question

I am planning to write a webcrawler for a NLP project, that reads in the thread structure of a forum everytime in a specific interval and parses each thread with new content. Via regular expressions, the author, the date and the content of new posts is extracted. The result is then stored in a database.

The language and plattform used for the crawler have to match the following criteria:

easily scalable on multiple cores and cpus
suited for high I/O loads
fast regular expression matching
easily to maintain/few operational overhead

After some research I think Erlang might be a fitting candidate, but I read it's not very good at string processing (and so regular expression matching). Neither do I have any expirience about the maintenance factor.

Is Erlang a good technology for the scenario described above? And if not, what would be a good alternative?

hoju · Accepted Answer

I am also evaluating erlang for use as a web crawler and it looks good so far.

There are lots of existing helpful modules: HTML parser, HTTP client, XPath, regex, cache.

And other people are interested in the same use case, so you can learn from them.

However if this is just a one off project I recommend Python / Ruby / Perl because it will be easier to get started with.

Kiril · Answer

If you're familiar and comfortable with erlang then I'd stick with it if I were you, although I'm not familiar with erlang. With that noted, I'll give you some pointers:

Don't use regular expressions to parse HTML, use XPATH instead.
HTML, while structured, is still quite difficult to parse in the wild and regular expressions are fairly ~~slow and~~ unreliable for parsing HTML.
Determine what your crawler architecture is going to be and what is your re-visit policy.
Find the best selection policy for you and implement it.

A web crawler is a fairly complex system to build and you have to be concerned about speed, performance, scalability and concurrency. Some of the most notable crawlers are written in C++ and Java, but I have not heard of any crawlers written in erlang.

Alexey Romanov · Answer

Erlang is fine for this. Its regex library delegates (nearly all) work to PCRE, which should be fast enough. But avoid strings and use binaries instead! They both use a lot less memory and are faster to translate to C strings.

Is Erlang the right choice for a webcrawler?

Tags:

erlang

web-crawler

Thomas

3 Answers

hoju

Kiril

Alexey Romanov

Recent Activity

Donate For Us

Is Erlang the right choice for a webcrawler?

Tags:

erlang

web-crawler

Thomas

3 Answers

hoju

Kiril

Alexey Romanov

Related questions

Recent Activity

Donate For Us