I need to write a program to scrape forums.
Should I write the program in Python using the Scrapy framework or should I use Php cURL? Also is there a Php equivalent to Scrapy?
Thanks
I would choose Python due to superior libxml2 bindings, specifically things like lxml.html and pyQuery. Scrapy has its own libxml2 bindings, I haven't looked at them to test them, though skimming the Scrapy documentation didn't leave me very impressed (I've done lots of scraping just using these parsers and manual coding). With any of these you get a truly superior HTML parser, querying via XPath, and with lxml.html and pyquery (also built on lxml) you get CSS selectors.
If you are doing a small job scraping a forum, I'd skip a scraping framework and just do it by hand -- it's easy and parallelizing etc is not really needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With