Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a record that is split into multiple lines and also how to handle broken records during input split

I have a log file as below

Begin ... 12-07-2008 02:00:05         ----> record1
incidentID: inc001
description: blah blah blah 
owner: abc 
status: resolved 
end .... 13-07-2008 02:00:05 
Begin ... 12-07-2008 03:00:05         ----> record2 
incidentID: inc002 
description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
owner: abc 
status: resolved 
end .... 13-07-2008 03:00:05

I want to use mapreduce for processing this. And I want to extract the incident ID, status and also the time taken for the incident

How to handle both the records as they have variable record lengths and what if the input split happens before the record ends.

like image 438
ghosts Avatar asked Jul 18 '13 02:07

ghosts


1 Answers

You'll need to write your own input format and record reader to ensure proper file splitting around your record delimiter.

Basically your record reader will need to seek to it's split byte offset, scan forward (read lines) until it finds either:

  • the Begin ... line
    • Read lines upto the next end ... line and provide these lines between the begin and end as input for the next record
  • It scans pasts the end of the split or finds EOF

This is similar in algorithm to how Mahout's XMLInputFormat handles multi line XML as input - in fact you might be able to amend this source code directly to handle your situation.

As mentioned in @irW's answer, NLineInputFormat is another option if your records have a fixed number of lines per record, but is really inefficient for larger files as it has to open and read the entire file to discover the line offsets in the input format's getSplits() method.

like image 55
Chris White Avatar answered Sep 26 '22 10:09

Chris White