Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing a program to scrape forums

I need to write a program to scrape forums.

Should I write the program in Python using the Scrapy framework or should I use Php cURL? Also is there a Php equivalent to Scrapy?

Thanks

like image 835
seanieb Avatar asked Jun 29 '26 16:06

seanieb


1 Answers

I would choose Python due to superior libxml2 bindings, specifically things like lxml.html and pyQuery. Scrapy has its own libxml2 bindings, I haven't looked at them to test them, though skimming the Scrapy documentation didn't leave me very impressed (I've done lots of scraping just using these parsers and manual coding). With any of these you get a truly superior HTML parser, querying via XPath, and with lxml.html and pyquery (also built on lxml) you get CSS selectors.

If you are doing a small job scraping a forum, I'd skip a scraping framework and just do it by hand -- it's easy and parallelizing etc is not really needed.

like image 181
Ian Bicking Avatar answered Jul 02 '26 06:07

Ian Bicking