Our company is working on a project that requires a database with 30-50 million rows of product data. These rows contain text that needs to be searched concurrently thousands of times per second. Moreover, each search needs to take less than one second to execute.
So, all in all, we have a 50M row database that needs to be searched thousands of times per second. Keep in mind that these are fulltext searches. I know MySQL or any relational database alone can not handle this type of job. So, we're looking for someone who can design the right setup for us and help us implement, for a price you specify.
First off, we'd like to know what our best options here are. I've personally been researching things such as Sphinx, Lucene, Cassandra, MongoDB, CouchDB, Solr, etc, but really don't know which should be used in conjuction with another to give us the most efficient setup possible.
So, if anyone could just give some advice, or take up our job offer, it would be greatly appreciated.
You can contact me via PM here, and I'll give you my email/IM/phone number to further discuss.
Thanks!
Storing data and searching are two different things. If you look at architectures like ebay, they have seperate services & servers for search operation. 50m rows is nothing, you can store it with any of the datastores, none of them is perfect so the difference is use cases. Eg: cassandra has the fastest insert performance with any data size, can scale to petabytes with hundreds of machines easyly (no need to shard), has lucandra (cassndra-lucene integration, scales well with massive data but a toy when compared to elasticsearch), high durability,... MongoDB has more query options (uses btree as a dbms), has autosharding recently, can index all fields, but poor durability,... Postgresql is the most advanced opensource dbms out there, has builtin master/slave replication recently, can scale by sharding, acid & sql compliant... couchdb has not any advantage compared to others in a use case I think, it's damn slow, If I need acid I probably use postgresql. Builtin fullText search functionality with these datastores has some problems and not scalable.
The most advenced (massive data, high performance, simple, distributed, fault tolerant, rest api) open source search engine is elasticsearch, you can think of it as distributed lucene. Solr is lagecy compared to elascticsearch. use of raw lucene/sphinx is not scalable.
If I were you, I probably choose one of the datastores and use elasticsearh for indexing and synhronize them on my data access layer (need to modify indexes on db insert/update/delete).
Regards
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With