Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering logs with regex in java

The description is quite long, so please bear with me:
I have log files ranging from 300 mb to 1.5 Gb in size, which need to be filtered given a search key.

The format of the logs is something like this:

24 May 2017 17:00:06,827 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content
24 May 2017 17:00:06,828 [INFO] 567890 (Blah : Blah1) Service-name:: Content( May span multiple lines)
24 May 2017 17:00:06,829 [INFO] 123456 (Blah : Blah2) Service-name: Multiple line content. Printing Object[ ID1=fac-adasd ID2=123231
ID3=123108 Status=Unknown
Code=530007 Dest=CA
]
24 May 2017 17:00:06,830 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content
4 May 2017 17:00:06,831 [INFO] 567890 (Blah : Blah2) Service-name:: Content( May span multiple lines)

Given the search key 123456, I need to fetch the following:

24 May 2017 17:00:06,827 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content
24 May 2017 17:00:06,829 [INFO] 123456 (Blah : Blah2) Service-name: Multiple line content. Printing Object[ ID1=fac-adasd ID2=123231
ID3=123108 Status=Unknown
Code=530007 Dest=CA
]
24 May 2017 17:00:06,830 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content

The following awk script gets my job done(very slowly):

gawk '/([0-9]{1}|[0-9]{2})\s\w+\s[0-9]{4}/{n=0}/123456/{n=1} n'

It takes around 8 minutes to search a log file of 1 gb size. And I need to do this for many such files. To top it off, I have multiple such search keys, which makes the whole task kind of impossible.

My initial solution is to use multithreading. I have used a fixedThreadPoolExecutor, submitted a task for each file that needs to be filtered. Inside the task description, I have spawned new process using java's Runtime(), which would execute the gawk script using bash and write the output to a file and then merged all the files.

Although that might seem like a poor way to go about, since the filtering is I/O dependent rather than CPU, it did give me a speedup compared to executing the script on each file sequentially.

But it still isn't sufficient as the whole thing takes 2 hrs, for a single search key, with 27gb of log files. On an average, I have 4 such search keys and need to fetch all of their results and put them together.

My method isn't efficient because:

A) It accesses each log file multiple times when multiple search keys are given and causes even more I/O overhead.
B) It incurs the overhead of creating a process inside each thread.

A simple solution to all of this, is moving away from awk and doing the whole thing in java, using some regex library. The question here is what is that regex library that could provide me with the desired output?
With awk I have the /filter/{action} property which lets me specify a range of multiple lines, to be captured(as seen above). How can I do the same inside java ?

I'm open to all kinds of suggestions.For example, an extreme option would be to store the log files in a shared filesystem like S3 and process the output using multiple computers.

I'm new to stackoverflow and I don't even know if I can post this here. But I've been working on this for the past week and I need someone with expertise to guide me on this. Thanks in advance.

like image 886
gitmorty Avatar asked Jun 21 '17 08:06

gitmorty


1 Answers

You have a few options.

The best one imo would be to use an inversed dictionary. That means that for each keyword x present in at least one of the logs you store a reference to all logs that contain it. But as you already spent a week on this task I'd advice to use something that's already there and does exactly that: Elasticsearch. You can actually use the full ELK stack (elasticsearch, logstash, kibana - designed mainly for logs) to even parse the logs as you can just put a regex expression in the config file. You will only need to index the files once and will get searches as fast as a few milliseconds.

If you really want to waste energy and not go for the best solution, you can use map-reduce on hadoop to filter the log. But that's not a task where map-reduce is optimal and it would be more like a hack.

like image 114
Dinu Sorin Avatar answered Oct 21 '22 22:10

Dinu Sorin