Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

New Relic for Amazon Kinesis worker monitoring

We're using Amazon Kinesis (a queue service) and have queue readers written in Java. They basically read from the queue and insert data into our datastore. I was wondering if anyone has had success in using New Relic to monitor background queue workers?

Some analytics I'm interested in:

  1. How many queue workers are running right now? (they scale up and down based on load)
  2. How many messages/second is each queue worker handling? How does this look over time?
  3. How many messages/second is the entire worker fleet handling?
  4. The workers make requests to both MySQL and Cassandra. What fraction of their time is spent doing this?
  5. We're logging with log4j. If the workers have generated errors/traces, what are they? What is the error rate over time?

Thanks,

Advait

like image 825
advait Avatar asked Nov 01 '22 16:11

advait


1 Answers

New Relic doesn't have any trouble monitoring batch jobs as opposed to web transactions, so that won't be an issue.

Assuming you're starting out with a Java app for which you have source code available, the best path forward is to use the agent API: https://docs.newrelic.com/docs/agents/java-agent/custom-instrumentation/java-agent-api . This leaves you in a good place to report any metrics that you like even if we don't record them automatically. I'll answer your questions 1 by 1:

1) We have a couple ways to slice this pie, but the easiest one I can think of is to make a NewRelic.recordMetric("Custom/Queue_worker/alive",1) call. I'd just have a timer running to make that call once a minute (since that's our metric harvest cycle) on each worker. Then in a custom dashboard (https://docs.newrelic.com/docs/apm/dashboards-menu/custom-dashboards) you can ignore the metric values (which will be averaged - so unless you have a master that "knows" the value and can just report it as often as you want, you won't get the desired effect by reporting 1+1+1...=1). You'll be graphing the call_count field to see how many workers ran that minute.

2) In this case, you would want to use much the same pattern as above, except creating a different custom metric per worker. Fortunately custom dashboards help out with the heavy lifting here - doing something like NewRelic.recordMetric("Custom/Queue_worker/y/number_of_messages",x) for x = the number of messages processed, y = some unique identifier (GUID? random value?) per worker...over a minute - and then you can just graph Custom/Queue_worker/*/number_of_messages to get them all laid out on the same graph.

3) have each worker submit the same custom metric, Custom/queue_worker/message_sent, and graph call count on that metric. Once again you can't just report a value for each worker since the subsequent metric data will be averaged together, but we will keep a good call count for you.

4) you'll get MySQL time for free (as long as you're using the mysql or JDBC connector listed here: https://docs.newrelic.com/docs/agents/java-agent/getting-started/new-relic-java#h2-compatibility) - it will show up as 'database' time in your graphs and transaction traces. For Cassandra, we have no specific instrumentation but you can use the agent API once again (NewRelic.recordResponseTimeMetric() recommended) to at least record this and graph it separately.

5) You get error rate for free, as long as your errors consist of unhandled exceptions - or you can make an API call anytime you're handling an exception (or any error condition you want to tag) to NewRelic.noticeError(). Further, if the errors come in as unhandled exceptions (neat trick: handle your exception in your code, then rethrow it so our agent sees it with context), you'll get a stack trace and any metadata about the transaction that you've recorded with NewRelic.addCustomParameter(). We don't do logfile processing, though you could write a very small program to do that processing and import the metrics using the above methods, and since we license per running host, not per agent, you could run that on an already-licensed worker for no additional cost.

There's much easier ways to do this using Insights (https://docs.newrelic.com/docs/insights/new-relic-insights) - for instance, you can access the list of running agents over time without any additional work, and you can report numbers that won't be averaged upon which you can do math and graph them. But that's a separate product and I'm not trying to upsell you :)

note: I work for New Relic.

like image 96
fool Avatar answered Nov 15 '22 06:11

fool