I'd need to build a simple analytics back-end for capturing user behaviour. This will be captured via a Javascript snippet on a webpage just like Google Analytics or Mixpanel data.
The system needs to capture close-to-realtime browser data (scrolling position of page, mouse position etc.) It will record the state of the users' page every 5 seconds. There are only three attributes on each measurement but they are have to be taken frequently.
The data doesn't necessarily need to be sent every 5 seconds, it could be bussed up less frequently however it's imperative that I get all of the data while the user is on the page. i.e. I can't bus it once per minute and lose the last 59 seconds of data for someone who leaves after 119 seconds.
If possible I'd like to build a system that will scale for the foreseeable future which means it working for 10,000 sites, each with 100 concurrent visitors, i.e. 100,000 concurrent users each sending one event every 5 seconds.
I'm not worried about querying the data, that can be done using a separate system. I'm most interested in how to handle the capture of the data itself.
Requirements
Based on the budgeting above, the system needs to handle 20,000 events per second coming from a pool of 100,000 users.
I'd like to host this service on Heroku however while I've done a lot of work with Rails, I'm completely new to high throughput systems (other than knowing you don't process them using Rails).
Questions
- Is there a commercial system that would be good for doing this (like Pusher but for data capture as well as distribution)?
- Should I be looking to do this using HTTP requests or websockets?
- Is node.js the right choice for this or just trendy?
- If I were to chose a socket-based solution, how many sockets can a dyno on Heroku handle for each webserver
- What are the pertinent considerations for choosing between Mongo / Reddis etc. for storage
- Is this the type of problem which actually requires two solutions - the first to get you to reasonable scale quickly and inexpensively and the second to take you past that scale on lower incremental cost but with more development effort required upfront?
My high level comment for you is to build your system following the 12 factor design, and then worry about scaling as the customers arrive. I'm thrilled with Node.js and the npm ecosystem, but I also think you could build a perfectly acceptable platform with Rails. If it took 3 dynos to support 100 K concurrent users with Node, and double that with Rails, you still might be better off with Rails, if your comfort with Ruby got you to market 3 months faster. Anyway, assuming you go with Node, here are my answers:
- Here are some alternatives to Pusher that might work for you and a discussion of Pusher vs. Pubnub. Also see Ably.
- Use socket.io. It's largely the standard, because it uses the best transport available and falls back from WebSockets to HTTP methods.
- Node is a fantastic choice and is also trendy (see the module growth rate). I suspect you could make your system work fine in Node, Rails or several other frameworks.
- A Heroku dyno should be able to support tens of thousands of concurrent connections, depending on how efficient you are with RAM. A server with 16 GB of RAM was able to support a million concurrent connections. Assuming you're RAM-limited, a Heroku dyno with 512 MB of RAM should be able to support ~30 K connections.
- You likely want to pick two different systems, one for storage and processing of your data, and one for caching. Here's a great post about picking your core data platform from the creator of Instagram. For core data, I recommend Postgres (on Heroku) using the Sequelize ORM. But, Mongo with SOLR for search would probably work fine too. Note that Postgres 9.2 can be used as a NoSQL datastore if that's the way you want to go. For a caching system I highly recommend Redis.
- No, I would try to avoid throw away engineering. Instead, build something that works, and expect that everytime you reach an order of magnitude more traffic, some piece of the system will break and need to be replaced. But, if you follow the 12 Factor principles, you should be in good shape to scale horizontally while you're investing in the replacement.
Good luck.
- There are many services for sockets, but Pusher and Pubnub seem to be the market leaders in this space. What ever you do, don't host your own like socket.io because heroku times out requests longer than 30 seconds, including websockets. So a hosted socket would definitely be out of the question unless you plan on closing and re-opening the socket every few seconds.
- If you were to use a socket service like Pusher, then you will need to implement a http endpoint for the service to send you the data anyway. So I would just cut the middle man out and go with a direct http request. Granted you need to collect constant user interactions, but that can all be recorded on the JavaScript client and sent back to the app periodically through CORS XHR or a tracking image.
- node is a great choice, it's light, pretty easy to set up and the npm libraries available will have everything you need to get you started. Rails can be pretty swift too, especially if you cut out the things you don't need. There is a great railscast on this subject. The important thing is to keep it as simple as possible. Maybe split it into two applications; one for collecting data, the other for analysing/process it. This way you could collect the data in node cause it's fast and analyse/process it in rails cause it's easy.
- As I mentioned in 1. sockets just aren't going to work in heroku and even if you used pusher you're still going to have to support the same number of http requests because when pusher receives the data it's going to send it straight on to you. As for how many dynos will you need, this will be something that will be easily tested but not something I can estimate. It will depend entirely on the efficiency of the code collecting the data. A simple Apache AB test with the load and concurrency you are expecting will give you a good indication of what you will need. Node comes with it's own concurrency but if you were to use rails to collect the data then use unicorn or puma as your server because they support concurrency. Also try different configurations when Apache AB testing; heroku now provide 2x dynos which are 1024mb instead of 512 which will allow you more concurrency
-
This stackoverflow thread suggests redis is faster and faster is what you're going to want for collecting the data. Though after collecting it, you'll probably want to process it and store it in more than a key, value store. Mongo is a good option for that but i would go with a graph database like neo4j because of the intricate connections analytics have.
- If your entering new ground here, then you are not going to get it right first time, you will find yourself iterating over it to get the best performance and the most accurate data. Eventually you'll probably delete it and start again with a new architecture and the cycle will continue. Keeping the data collection and the analysis separate means you can focus on getting each bit right separately.
A few addional points I would like to mention is use a CDN for distribution of the JavaScript client, or better yet, provide the full JS to serve from the page. Either way, load fast and load asynchronously. It sounds like a fun project. Good luck!
EDIT In an alternate universe, where you do not have to use heroku, websockets would be an awesome solution.