Using Twitter Storm to process log data?

Tags:

We have a large number of applications distributed across many machines in multiple datacentres.

Throughout the day, we'll receive signals (either internal or external), which cause a cascade of events throughout each application.

Each signal thus produces a huge amount of event log data. The loglines themselves aren't particuarly structured and they're also quite different between applications. They do follow the basic convention though:

<timestamp> <calling function/method> <payload>

We have ID numbers in loglines that can help link together events to a signal - however, these aren't foolproof, and we sometimes need to use other ways to try to piece events together.

I've been reading up about Twitter's Storm system and I'm quite interested in trying it out to analyse this mass of log data in realtime, and piece it together.

I'd like to do things like:

Produce reports and streaming graphs based on trends from the data in realtime.
Query a signal, then bring up the entire chain of events related to that signal in all applications, including delays between steps in the chain. (This is important).
View correlated events, and drill into what else an application was doing around the time of a certain event.

Getting the data in?

The log data is stored in local logfiles (and this is unlikely to change), so we'd need a way to slurp in the data into Storm itself. Logfiles may also be compressed. I've though about using Flume, or Logstash - what are people's thoughts on these? Or are there alternatives ways that would work well with Storm?

Storing events?

I also need both a way to store the data for the live reports and graphs, as well as the event data itself.

It's the second part I'm finding a bit tricky - what sort of storage backends are suitable for storage events, as well as the links between them? Would some kind of graph database be suitable, one of those new-fangled schemaless NoSQL ones, or something a bit more traditional?

Is Storm suitable?

Finally, is Storm suitable for this role, or is something else a better fit?

And if I do go with Storm, what sort of approach can I take to tackle this? I'm hoping other people have experience with similar problemsets.

Cheers, Victor

562

asked Feb 06 '13 14:02

victorhooi

2 Answers

Produce reports and streaming graphs based on trends from the data in realtime

This one sounds like an excellent fit.

Query a signal, then bring up the entire chain of events related to that signal in all applications, including delays between steps in the chain. (This is important).

If your query is confined to recent data (=not a lot of data) & you can permit data loss, I can imagine doing this using only Storm. If not, I might combine Storm with a database and use Storm mainly for preprocessing & storing the data to the database. The query is probably better handled using the database in this case.

View correlated events, and drill into what else an application was doing around the time of a certain event.

Storm is great when you know what query you'll be performing, and you don't need access to a lot of data for the queries. For example, serving up a feed that shows correlated events would be a great fit. Providing means to perform ad-hoc queries (drill down) would probably be easier with a database. Also, if you want to allow a user to query a large amount of data (e.g. a week worth of data as opposed to a hour worth of data etc.), then you will probably need a database.

As for feeding the data in, I would use a log centralisation product. You can create a Spout that interacts with whatever interface that product will provide. Alternatively, if you are using logging framework that allows sending logs via sockets, via JMS etc. (like log4j), you could have a spout reading from that socket/JMS queue etc.

As for DB choices, it really depends on what you want to do. If you don't know what kind of activity you'll be logging and want to correlate events, my bet will be on a graph database, as traversing events would be easy.

157

answered Oct 31 '22 12:10

Enno Shioji

This sounds a lot like the case I´m working on at the moment so I´ll give a few ideas of what would be possible to do.

For getting the data in, you can take a look at Apache Kafka. This messaging system can get your logs off the applications and into an intermediary storage. From there, different systems can attach as consumers with Storm being one of them integrating well using a special Storm-Kafka spout.

In our case we have some real-time data consumed directly from Kafka brokers and into monitoring/dashboards and other data streams that needs processing through Storm. The latter is stored in a distributed DB (MongoDB, Cassandra or Couchbase) depending on the nature of the data, which is then loaded in dashboards and other systems.

For batch jobs, you can also load data from Kafka into Hadoop and all this can be done independently from each other, pulling the same data from Kafka to multiple systems.

Kafka also supports multiple data centers through mirror-maker.

answered Oct 31 '22 14:10

Lundahl

Related questions
                            
                                ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console
                            
                                Slf4j or Logback: Turn off logging for 1 unit test (or 1 thread)
                            
                                Logback TCP syslog conforming RFC 5424
                            
                                Git log pretty format, newline after placeholder if non-empty
                            
                                What are the Default, System and Microsoft LogLevels in ASP.NET Core
                            
                                log4J is Not Writing to Specific Log File in Spring Boot Microservice
                            
                                Python logging: left align with brackets
                            
                                C# Serilog: how to log with String interpolation and keep argument names in message templates?
                            
                                How do I find the type of the object instance of the caller of the current function?
                            
                                Best practices for logging on top of logging levels
                            
                                Summary count for Python logging
                            
                                Strategy for extracting messages of most useful commits to changelog
                            
                                Getting IIS application pool recycle events to be logged in the Windows Event Log
                            
                                How do you trace the entry into and exit from all functions in Spring-based web app?
                            
                                Django - logging unique ID
                            
                                is it possible to run chrome in debug mode?
                            
                                Advantages of logging vs. print() + logging best practices
                            
                                mysql Trigger for logging, find changed columns
                            
                                How to config flask app.logger from a configure file?
                            
                                Python/Django: automatically log when exceptions occur, including request info

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Twitter Storm to process log data?

Tags:

logging

apache-storm

bigdata