Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Setting architecture for a mobile application with external APIs and smart content suggestions

Some of my colleagues and I have started to work on an iPhone application that provides a social buying experience for the user. The goal is to provide the user with extended search capabilities (full text, fuzzy search, based on filters etc) on millions of products that are constantly fetched from several product listings APIs (such as eBay and Amazon), then normalized (i.e. transformation of fields, categories and relations), applied with some business logic so that users will be able to get customized content based on several criteria (unique profile i.e. age/gender, searches history, what my friends bought etc). The application also have social features such as posts, likes and reviews about the products, following other users etc.

So now we are trying to design the server architecture that will support these needs and among other things there are the performance considerations ("GIVE ME all the products THAT match my search word AND ORDER them by relevancy" should run pretty fast ~ 1 to 10 seconds) and the scalability consideration (10 consequence users will get a result in the same amount of time as 100,000 users, providing I can throw more machines to tackle the issue).

We assume we will have ~ tens-hundreds of millions of products

What we had in mind is (based on AWS):

  1. Set up Elastic Beanstalk to support scalability by throwing more EC2 instances whenever the traffic grows and taking them down when it diminishes
  2. Set up RDS with MySQL as the RDBMS for the application (manage users, profiles, normalized products etc) with several availability zones
  3. Set up a background "agent" process on a different server to constantly fetch products data from the APIs (having a customizable fetching Que)
  4. Store the above "raw data" inside some NoSQL as temp data
  5. Set up another "agent" for normalization of the data, profiling it and insert it inside the RDBMS in a way that will able very quick searches that are already based on the user distinct profiles
  6. Set up caching mechanism to reduce the loads on the RDBMS
  7. Set up a good full text search engine (i.e. Lucene)

Our main considerations are:

  1. Linux environment
  2. Mainly PHP and MySQL
  3. Performance is an issue
  4. Scalability will become an issue in the near future (6-12 months) (hopefully :))

Now several questions:

  1. Is the architecture makes sense?
  2. Regarding the data storage - is RDBMS is the right choice or maybe we should consider a NoSQL engine (i.e MongoDB)?
  3. What techniques/approaches should we consider upon tackling this problem?

By the way, war stories would be much appreciated :)

like image 747
Gregra Avatar asked Nov 02 '22 07:11

Gregra


1 Answers

I think for what you've described, you will probably want to avoid Elastic Bean Stalk, and deploy right onto EC2 instance that you control.

Front end will run web load, and mostly query from cache. This can be behind an elastic load balancer, and you can use autoscaling rules to ensure you always have enough resources to handle the load.

I would probably look at solr for full text search, but I'm not an expert in this - I think solr will have some of the scalibility, replication, etc to make managing your search infrastructure a bit easier to manage. There are some good AWS Solr reference architectures that are designed to scale.

It sounds like you will need a couple of back end service layers - one to pull in the data, another to normalize it. If you are going to commit to AWS, you can probably build these so that a central control process doles out work to instances you get via the spot market - that can help to reduce the overall costs. If the spot market spikes, you can choose to either slow down importing/processing, or use on-demand instances and increase the costs a bit.

I'd probably design this to use a combination of mysql and a no-sql store. Mysql for core functionality - accounts, user preferences, etc, but NoSQL for the product information. You probably want to store that in a format that can be used directly by the UI with minimal processing. Properly designed, this should allow sharding of the NoSQL store, which will help scalibility, although you'll need a way to reproduce data if a node goes down.

To handle the relationship between products and related data (comments, posts, etc) you will need to associate them with whatever key is used to retrieve them from the NoSQL store. If you are going to be dealiing with millions and millions of product records, you will probably want to determine your data retention requirements - do you really need to keep details of a product that has been obsolete and/or unavailable for years?

If search is going to be the primary interface to the data, however, you may not need a NoSQL solution - simply pull what you need back from solr.

You can put caching in front of most of these layers.

like image 167
chris Avatar answered Nov 15 '22 07:11

chris