We have a large number of applications distributed across many machines in multiple datacentres.
Throughout the day, we'll receive signals (either internal or external), which cause a cascade of events throughout each application.
Each signal thus produces a huge amount of event log data. The loglines themselves aren't particuarly structured and they're also quite different between applications. They do follow the basic convention though:
<timestamp> <calling function/method> <payload>
We have ID numbers in loglines that can help link together events to a signal - however, these aren't foolproof, and we sometimes need to use other ways to try to piece events together.
I've been reading up about Twitter's Storm system and I'm quite interested in trying it out to analyse this mass of log data in realtime, and piece it together.
I'd like to do things like:
The log data is stored in local logfiles (and this is unlikely to change), so we'd need a way to slurp in the data into Storm itself. Logfiles may also be compressed. I've though about using Flume, or Logstash - what are people's thoughts on these? Or are there alternatives ways that would work well with Storm?
I also need both a way to store the data for the live reports and graphs, as well as the event data itself.
It's the second part I'm finding a bit tricky - what sort of storage backends are suitable for storage events, as well as the links between them? Would some kind of graph database be suitable, one of those new-fangled schemaless NoSQL ones, or something a bit more traditional?
Finally, is Storm suitable for this role, or is something else a better fit?
And if I do go with Storm, what sort of approach can I take to tackle this? I'm hoping other people have experience with similar problemsets.
Cheers, Victor
The important properties of Storm are: Simple programming model. Similar to how MapReduce lowers the complexity of doing parallel batch processing, Storm lowers the complexity for doing real-time processing.
A tweetstorm is a series of related tweets posted by a Twitter user in quick succession. A 'tweetstorm' is a series of related tweets posted by a Twitter user in quick succession. But before 'tweetstorm' referred to a barrage of tweets from one among the hundreds of Twitter users you follow, it had another use.
Produce reports and streaming graphs based on trends from the data in realtime
This one sounds like an excellent fit.
Query a signal, then bring up the entire chain of events related to that signal in all applications, including delays between steps in the chain. (This is important).
If your query is confined to recent data (=not a lot of data) & you can permit data loss, I can imagine doing this using only Storm. If not, I might combine Storm with a database and use Storm mainly for preprocessing & storing the data to the database. The query is probably better handled using the database in this case.
View correlated events, and drill into what else an application was doing around the time of a certain event.
Storm is great when you know what query you'll be performing, and you don't need access to a lot of data for the queries. For example, serving up a feed that shows correlated events would be a great fit. Providing means to perform ad-hoc queries (drill down) would probably be easier with a database. Also, if you want to allow a user to query a large amount of data (e.g. a week worth of data as opposed to a hour worth of data etc.), then you will probably need a database.
As for feeding the data in, I would use a log centralisation product. You can create a Spout that interacts with whatever interface that product will provide. Alternatively, if you are using logging framework that allows sending logs via sockets, via JMS etc. (like log4j), you could have a spout reading from that socket/JMS queue etc.
As for DB choices, it really depends on what you want to do. If you don't know what kind of activity you'll be logging and want to correlate events, my bet will be on a graph database, as traversing events would be easy.
This sounds a lot like the case I´m working on at the moment so I´ll give a few ideas of what would be possible to do.
For getting the data in, you can take a look at Apache Kafka. This messaging system can get your logs off the applications and into an intermediary storage. From there, different systems can attach as consumers with Storm being one of them integrating well using a special Storm-Kafka spout.
In our case we have some real-time data consumed directly from Kafka brokers and into monitoring/dashboards and other data streams that needs processing through Storm. The latter is stored in a distributed DB (MongoDB, Cassandra or Couchbase) depending on the nature of the data, which is then loaded in dashboards and other systems.
For batch jobs, you can also load data from Kafka into Hadoop and all this can be done independently from each other, pulling the same data from Kafka to multiple systems.
Kafka also supports multiple data centers through mirror-maker.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With