I'm building a program that will live on an AWS EC2 instance (probably) be invoked periodically via a cron job. The program will 'crawl'/'poll' specific websites that we've partnered with and index/aggregate their content and update our database. I'm thinking java is a perfect fit for a language to program this application in. Some members of our engineering team are concerned about the performance detriment of java's garbage collection feature, and are suggesting using C++.
Are these valid concerns? This is an application that will be invoked possible once every 30 minutes via cron job, and as long as it finishes its task within that time frame the performance is acceptable I would assume. I'm not sure if garbage collection would be a performance issue, since I would assume the server will have plenty of memory and the actual act of tracking how many objects point to an area of memory and then declaring that memory free when it reaches 0 doesn't seem too detrimental to me.
No, your concerns are most likely unfounded.
GC can be a concern, when dealing with large heaps & fractured memory (requires a stop the world collection) or medium lived objects that are promoted to old generation but then quickly de-referenced (requires excessive GC, but can be fixed by resizing ratio of new:old space).
A web crawler is very unlikely to fit either of the above two profiles - you probably don't need a massive old generation and should have relatively short lived objects (page representation in memory while you parse out data) and this will be efficiently dealt with in the young generation collector.
We have an in-house crawler (Java) that can happily handle 2 million pages per day, including some additional post-processing per page, on commodity hardware (2G RAM), the main constraint is bandwidth. GC is a non-issue.
As others have mentioned, GC is rarely an issue for throughput sensitive applications (such as a crawler) but it can (if one is not careful) be an issue for latency sensitive apps (such as a trading platform).
The typical concern C++ programmers have for GC is one of latency. That is, as you run a program, periodic GCs interrupt the mutator and cause spikes in latency. Back when I used to run Java web applications for a living, I had a couple customers who would see latency spikes in the logs and complain about it — and my job was to tune the GC to minimize the impact of those spikes. There are some relatively complicated advances in GC over the years to make monstrous Java applications run with consistently low latency, and I'm impressed with the work of the engineers at Sun (now Oracle) who made that possible.
However, GC has always been very good at handling tasks with high throughput, where latency is not a concern. This includes cron jobs. Your engineers have unfounded concerns.
Note: A simple experimental GC reduced the cost of memory allocation / freeing to less than two instructions on average, which improved throughput, but this design is fairly esoteric and requires a lot of memory, which you don't have on EC2.
The simplest GCs around offer a tradeoff between large heap (high latency, high throughput) and small heap (lower latency, lower throughput). It takes some profiling to get it right for a particular application and workload, but these simple GCs are very forgiving in a large heap / high throughput / high latency configuration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With