Some of my colleagues and I have started to work on an iPhone application that provides a social buying experience for the user. The goal is to provide the user with extended search capabilities (full text, fuzzy search, based on filters etc) on millions of products that are constantly fetched from several product listings APIs (such as eBay and Amazon), then normalized (i.e. transformation of fields, categories and relations), applied with some business logic so that users will be able to get customized content based on several criteria (unique profile i.e. age/gender, searches history, what my friends bought etc). The application also have social features such as posts, likes and reviews about the products, following other users etc.
So now we are trying to design the server architecture that will support these needs and among other things there are the performance considerations ("GIVE ME all the products THAT match my search word AND ORDER them by relevancy" should run pretty fast ~ 1 to 10 seconds) and the scalability consideration (10 consequence users will get a result in the same amount of time as 100,000 users, providing I can throw more machines to tackle the issue).
We assume we will have ~ tens-hundreds of millions of products
What we had in mind is (based on AWS):
Our main considerations are:
Now several questions:
By the way, war stories would be much appreciated :)
I think for what you've described, you will probably want to avoid Elastic Bean Stalk, and deploy right onto EC2 instance that you control.
Front end will run web load, and mostly query from cache. This can be behind an elastic load balancer, and you can use autoscaling rules to ensure you always have enough resources to handle the load.
I would probably look at solr for full text search, but I'm not an expert in this - I think solr will have some of the scalibility, replication, etc to make managing your search infrastructure a bit easier to manage. There are some good AWS Solr reference architectures that are designed to scale.
It sounds like you will need a couple of back end service layers - one to pull in the data, another to normalize it. If you are going to commit to AWS, you can probably build these so that a central control process doles out work to instances you get via the spot market - that can help to reduce the overall costs. If the spot market spikes, you can choose to either slow down importing/processing, or use on-demand instances and increase the costs a bit.
I'd probably design this to use a combination of mysql and a no-sql store. Mysql for core functionality - accounts, user preferences, etc, but NoSQL for the product information. You probably want to store that in a format that can be used directly by the UI with minimal processing. Properly designed, this should allow sharding of the NoSQL store, which will help scalibility, although you'll need a way to reproduce data if a node goes down.
To handle the relationship between products and related data (comments, posts, etc) you will need to associate them with whatever key is used to retrieve them from the NoSQL store. If you are going to be dealiing with millions and millions of product records, you will probably want to determine your data retention requirements - do you really need to keep details of a product that has been obsolete and/or unavailable for years?
If search is going to be the primary interface to the data, however, you may not need a NoSQL solution - simply pull what you need back from solr.
You can put caching in front of most of these layers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With