so we have an application which is not thread-safe. Some of the library it is using is doing a lock on file-system level. Unfortunately, it's not working correctly and will crash and throw an error if there are some concurrent usage of the library. We also can't switch this library away. To achieve concurrency, which one is better? Running 100 containers in one powerful machine or splitting it into 100 small machines?
Since we are using Amazon, I am thinking about 100 X t2.micro instances each running one container VS one c4.8xlarge machine with 100 docker containers. We don't have any problem with memory. The tasks are CPU-bound. But it's also not so heavy that a t2.micro instance is enough to handle it as long as it only processes one at one time.
I got into a discussion with a colleague about which one is better. I prefer the 100 instances because I think the Docker isolation will be a significant overhead. It's like you have only one resource, but it's split into 100 people who needs to use the resource. On the other side, my colleague makes a point which I think might be valid. Creating a Linux namespace is lighter than starting a whole OS. So if we have 100 machines, we have 100 OSes, while with a big machine, we only have 1 OS.
The thing is, I don't know which one is correct. Could someone who have knowledge in this explain which one would be better and give me a concrete reason?
Since I realized I have just asked a bad question, I will try to add more information here. To make the question more precise, I am not really asking which one will be better in my specific use case, or which is cheaper. It's just a curiosity which one will perform better in terms of CPU. Just imagine we have a very big computational problem, and we have to do 100 of them. We want to parallelize them, but they are not thread-safe. Is it better to do them in 100 small machines or 1 powerful machines with 100 containers? Which one will complete faster and why?
If we have only 1 powerful machines, will all these 100 containers not be fighting for resource and slow down the overall process? And if it's 100 small machines, maybe the overall performance will be slower because of the OSes or other factors? In any case, I don't have any experience with this. Of course I could try this, but in the end, since it's not the ideal environment (with a lot of factors), the result won't be authoritative anyway. I was looking for an answer from people who knows how both things work in low level and could argument which environment will complete the task faster.
The only "appropriate" answer to your question is: you have to test both options and find out which one is better. The reason for this is: you are running a very specific application, with a very specific workload and very specific requirements. Any recommendation without actual testing is a guess. Maybe an "educated guess", but not more than that.
That said, let me show you what I would consider when doing my analysis for such a scenario.
The docker overhead should be absolutely minimal. The tool "docker" itself is not doing anything -- it's just using regular Linux Kernel features to create an isolated environment for your application.
After the OS has booted up, it will consume some memory, true. But the CPU consumption by the OS itself should be negligible (even for very small instances). Since you mentioned that you don't have any problems with memory, it seems like we can assume that this "OS overhead" that your colleague mentioned would also probably be negligible.
If you consider the route "a lot of very small instances", you could also consider using the recently released t2.nano
instance type. You need to test if it has enough resources to actually run your application.
If you consider the route "one single very large instance", you should also consider the c4.8xl
instance. This should give you significantly more CPU power than the c3.8xl.
Cost analysis (prices in us-east-1):
Now let's analyze the amount of resources that you have on each setup. I'm focusing only on CPU and ignoring memory, since you mentioned that your application isn't memory hungry:
Finally, let's analyze the cost per resource
As you can see, larger instances typically provide higher compute density which is typically less expensive.
You should also consider the fact that the T2 instances are "burstable", in that they can go beyond their baseline performances (10% and 5% as above) for some time, depending on how much "CPU credits" they have. However, you should know that, although they start with some credit balance, it's typically enough for booting up the OS and not much more than that (you'd accrue more CPU credits over time if you don't push your CPU beyond your baseline, but it seems that it won't be the case here, since we are optimizing for performance...). And as we have seen, the "cost per resource" is nearly 3x that of the 8xl instances, so this short burst that you'll get probably wouldn't change this rough estimate.
You might also want to consider network utilization. Is the application network intensive? Either in latency requirements, or bandwidth requirements, or in the amount of packets per second?
Now, what about resiliency? How time-sensitive are these jobs? What would be the "cost of not finishing them in a timely manner"? You might also want to consider some failure modes:
To reduce the effect of "one single instance dying" on your workload, and still get some benefits from higher density compute (ie, large c3 or c4 instances), you could consider other options, such as: 2x c4.4xl, or 4x c4.2xl, and so on. Notice that the c4.8xl costs twice the c4.4xl but it contains more than twice the number of vCPUs. So the analysis above wouldn't be "linear", you'd need to recalculate some costs.
Assuming that you are OK with instances "failing" and your application can somehow deal with that -- another interesting point to consider is using Spot instances. With Spot instances, you name your price. If the "market price" (regulated by offer - demand) is below your bid, you get the instance and pay only the "market price". If the price fluctuates above your bid, then your instances are terminated. It's not uncommon to see up to 90% discounts when compared to On Demand. As of right now, c3.8xl is approximately 0.28$/h in one AZ (83% less than On Demand) and c4.8xl is about the same in one AZ (83% less as well). Spot pricing is not available for t2 instances.
You can also consider Spot Block, in which you say the number of hours you want to run your instances for, you'll pay typically 30% - 45% less than On Demand and there's no risk of getting "out bided" during the period you specified. After the period, your instances are terminated.
Finally, I would try to size my fleet of servers so that they are required for nearly a "full number of hours" (but not exceeding that number) (unless I'm required to finish execution ASAP). That is, it's much better to have a smaller fleet that will finish the jobs in 50 minutes than a larger one capable of finishing the job in 10 minutes. The reason is: you pay by the hour, at the beginning of the hour. Also, it's typically much better to have a larger fleet that will finish the job in 50 minutes than a smaller one that would require 1h05 minutes -- again, because you pay by the hour, at the beginning of the hour.
Finally, you mention that you are looking for "the best performance". What exactly do you mean by that? What is your Key Performance Indicator? Do you want to optimize to reduce the amount of time spent in total? Maybe reduce the amount of time "per unit"/"per job"? Are you trying to reduce your costs? Are you trying to be more "energy efficient" and reduce your carbon footprint? Or maybe optimize for the amount of maintenance required? Or focus on simplifying to reduce the ramp-up period that other colleagues less knowledgeable would require before they would be able to maintain the solution?
Maybe a combination of many of the performance indicators above? How would they combine? There's always a trade off...
As you can see, there isn't a clear winner. At least not without more specific information about your application and your optimization goals. That's why most of the time the best option for any kind of performance optimization is really testing it. Testing is also typically inexpensive: it would require maybe a couple hours of work to setup your environment, and then likely less than $2/h of test.
So, that should give you enough information to start your investigation.
Happy testing!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With