Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a multi-line data file with Python [closed]

Tags:

python

What's the best way to go about parsing the following multi-line data file with Python?

Police Response: 11/6/2012 1:34:06 AM   Incident Desc: Traffic Stop OFC:    Received: 11/6/2012 1:34:06 AM
Disp: PCHK  Location: CLEAR LAKE RD&GREEN HILL RD
Event Number: LLS121106060941   ID: 60941   Priority: 6 Case No:
Police Response:    Incident Desc: Theft    OFC:    Received: 11/6/2012 1:43:35 AM
Disp: CSR   Location: SCH BLACHLY
Event Number: LLS121106060943   ID: 60943   Priority: 4 Case No:
Police Response: 11/6/2012 1:47:47 AM   Incident Desc: Suspicious Vehicle(s)    OFC:        Received: 11/6/2012 1:47:47 AM
Disp: FI    Location: KIRK RD&CLEAR LAKE RD
Event Number: LLS121106060944   ID: 60944   Priority: 6 Case No:

Records are always broken up into 3 lines -- lines beginning with "Police Response" and ending with "Event Number". Some fields are often blank.

like image 624
Jake Avatar asked Nov 08 '12 18:11

Jake


2 Answers

this should do the trick. I split the data you have into a list of cases each containing there lines of your data. Then I used regular expression spiting to split by the field names. After that I put the list of key value pairs into a dictionary so that it's easy for you to loop through the cases and access any field values using the dictionary. I print out the contents of rows just to show the data structure.

code

from pprint import pprint
from collections import OrderedDict
import re

data = """Police Response: 11/6/2012 1:34:06 AM   Incident Desc: Traffic Stop OFC:    Received: 11/6/2012 1:34:06 AM
Disp: PCHK  Location: CLEAR LAKE RD&GREEN HILL RD
Event Number: LLS121106060941   ID: 60941   Priority: 6 Case No:
Police Response:    Incident Desc: Theft    OFC:    Received: 11/6/2012 1:43:35 AM
Disp: CSR   Location: SCH BLACHLY
Event Number: LLS121106060943   ID: 60943   Priority: 4 Case No:
Police Response: 11/6/2012 1:47:47 AM   Incident Desc: Suspicious Vehicle(s)    OFC:        Received: 11/6/2012 1:47:47 AM
Disp: FI    Location: KIRK RD&CLEAR LAKE RD
Event Number: LLS121106060944   ID: 60944   Priority: 6 Case No: """

lines = data.splitlines()
cases = ['\n'.join(lines[i:i+3]) for i in range(0, len(lines), 3)]
pattern = '(Police Response|Incident Desc|OFC|Received|Disp|Location|Event Number|ID|Priority|Case No):'
rows = []
for case in cases:
    pairs =  re.split(pattern, case)[1:]
    rows.append(OrderedDict((pairs[i*2], pairs[i*2+1]) for i in range(10)))

for i, row in enumerate(rows):
    print '============== {} =============='.format(i)
    pprint(row.items())

output:

============== 0 ==============
[('Police Response', ' 11/6/2012 1:34:06 AM   '),
 ('Incident Desc', ' Traffic Stop '),
 ('OFC', '    '),
 ('Received', ' 11/6/2012 1:34:06 AM\n'),
 ('Disp', ' PCHK  '),
 ('Location', ' CLEAR LAKE RD&GREEN HILL RD\n'),
 ('Event Number', ' LLS121106060941   '),
 ('ID', ' 60941   '),
 ('Priority', ' 6 '),
 ('Case No', '')]
============== 1 ==============
[('Police Response', '    '),
 ('Incident Desc', ' Theft    '),
 ('OFC', '    '),
 ('Received', ' 11/6/2012 1:43:35 AM\n'),
 ('Disp', ' CSR   '),
 ('Location', ' SCH BLACHLY\n'),
 ('Event Number', ' LLS121106060943   '),
 ('ID', ' 60943   '),
 ('Priority', ' 4 '),
 ('Case No', '')]
============== 2 ==============
[('Police Response', ' 11/6/2012 1:47:47 AM   '),
 ('Incident Desc', ' Suspicious Vehicle(s)    '),
 ('OFC', '        '),
 ('Received', ' 11/6/2012 1:47:47 AM\n'),
 ('Disp', ' FI    '),
 ('Location', ' KIRK RD&CLEAR LAKE RD\n'),
 ('Event Number', ' LLS121106060944   '),
 ('ID', ' 60944   '),
 ('Priority', ' 6 '),
 ('Case No', ' ')]
like image 187
Marwan Alsabbagh Avatar answered Oct 14 '22 22:10

Marwan Alsabbagh


The big question:

What is used to delimit the entries? If there are tabs between entries, that makes it easy, just split each line by tab. If there always at least two spaces, you can split by that. If there's sometimes just one space, that complicates things.

Otherwise, it's easy to make a generator/function to spit out three lines at a time, which you can then throw into a function that parses the three lines. The '3-lines-at-a-time' part of your problem is the easy part.

def return_3(file):
    return [file.next() for i in range(3)]
like image 38
kreativitea Avatar answered Oct 14 '22 20:10

kreativitea