I am planning to write a webcrawler for a NLP project, that reads in the thread structure of a forum everytime in a specific interval and parses each thread with new content. Via regular expressions, the author, the date and the content of new posts is extracted. The result is then stored in a database.
The language and plattform used for the crawler have to match the following criteria:
After some research I think Erlang might be a fitting candidate, but I read it's not very good at string processing (and so regular expression matching). Neither do I have any expirience about the maintenance factor.
Is Erlang a good technology for the scenario described above? And if not, what would be a good alternative?
I am also evaluating erlang for use as a web crawler and it looks good so far.
There are lots of existing helpful modules: HTML parser, HTTP client, XPath, regex, cache.
And other people are interested in the same use case, so you can learn from them.
However if this is just a one off project I recommend Python / Ruby / Perl because it will be easier to get started with.
If you're familiar and comfortable with erlang then I'd stick with it if I were you, although I'm not familiar with erlang. With that noted, I'll give you some pointers:
A web crawler is a fairly complex system to build and you have to be concerned about speed, performance, scalability and concurrency. Some of the most notable crawlers are written in C++ and Java, but I have not heard of any crawlers written in erlang.
Erlang is fine for this. Its regex library delegates (nearly all) work to PCRE, which should be fast enough. But avoid strings and use binaries instead! They both use a lot less memory and are faster to translate to C strings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With