I'm currently building a scrapy project that can crawl any website from the first depth to the last. I don't extract many data, but I store the all page HTML (response.body) in a database.
I am currently using Elasticsearch with the bulk API, to store my raw HTML.
I had a look in Cassandra but I did not find an equivalent of the Elasticsearch bulk and it affects the performances of my spider.
I am interested in performance, and was wondering if Elasticsearch was a good choice, and if maybe there was a more appropriate NoSQL database.
That very much depends on what you are planning to do with the scraped data later on.
Elasticsearch does some complex indexing operations upon insertion which will make subsequent searches in the database quite fast ... but this also costs processing time and introduces a latency.
So to answer your question whether Elasticsearch was a good choice:
If you plan to build some kind of search engine later on, Elasticsearch was a good choice (as the name indicates). But you should have a close look at the configuration of Elasticsearch's indexing options etc to make sure it works the way you need it.
If on the other hand you just want to store the data and do processing tasks on it later, Elasticsearch was a poor choice and you would be better off with Cassandra or another NoSQL database.
Which NoSQL database suits your needs best, depends - again - on the actual usage scenario.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With