I have a log file as below
Begin ... 12-07-2008 02:00:05 ----> record1
incidentID: inc001
description: blah blah blah
owner: abc
status: resolved
end .... 13-07-2008 02:00:05
Begin ... 12-07-2008 03:00:05 ----> record2
incidentID: inc002
description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
owner: abc
status: resolved
end .... 13-07-2008 03:00:05
I want to use mapreduce for processing this. And I want to extract the incident ID, status and also the time taken for the incident
How to handle both the records as they have variable record lengths and what if the input split happens before the record ends.
You'll need to write your own input format and record reader to ensure proper file splitting around your record delimiter.
Basically your record reader will need to seek to it's split byte offset, scan forward (read lines) until it finds either:
Begin ...
line
end ...
line and provide these lines between the begin and end as input for the next recordThis is similar in algorithm to how Mahout's XMLInputFormat handles multi line XML as input - in fact you might be able to amend this source code directly to handle your situation.
As mentioned in @irW's answer, NLineInputFormat
is another option if your records have a fixed number of lines per record, but is really inefficient for larger files as it has to open and read the entire file to discover the line offsets in the input format's getSplits()
method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With