What's the best way to go about parsing the following multi-line data file with Python?
Police Response: 11/6/2012 1:34:06 AM Incident Desc: Traffic Stop OFC: Received: 11/6/2012 1:34:06 AM
Disp: PCHK Location: CLEAR LAKE RD&GREEN HILL RD
Event Number: LLS121106060941 ID: 60941 Priority: 6 Case No:
Police Response: Incident Desc: Theft OFC: Received: 11/6/2012 1:43:35 AM
Disp: CSR Location: SCH BLACHLY
Event Number: LLS121106060943 ID: 60943 Priority: 4 Case No:
Police Response: 11/6/2012 1:47:47 AM Incident Desc: Suspicious Vehicle(s) OFC: Received: 11/6/2012 1:47:47 AM
Disp: FI Location: KIRK RD&CLEAR LAKE RD
Event Number: LLS121106060944 ID: 60944 Priority: 6 Case No:
Records are always broken up into 3 lines -- lines beginning with "Police Response" and ending with "Event Number". Some fields are often blank.
this should do the trick. I split the data you have into a list of cases each containing there lines of your data. Then I used regular expression spiting to split by the field names. After that I put the list of key value pairs into a dictionary so that it's easy for you to loop through the cases and access any field values using the dictionary. I print out the contents of rows just to show the data structure.
code
from pprint import pprint
from collections import OrderedDict
import re
data = """Police Response: 11/6/2012 1:34:06 AM Incident Desc: Traffic Stop OFC: Received: 11/6/2012 1:34:06 AM
Disp: PCHK Location: CLEAR LAKE RD&GREEN HILL RD
Event Number: LLS121106060941 ID: 60941 Priority: 6 Case No:
Police Response: Incident Desc: Theft OFC: Received: 11/6/2012 1:43:35 AM
Disp: CSR Location: SCH BLACHLY
Event Number: LLS121106060943 ID: 60943 Priority: 4 Case No:
Police Response: 11/6/2012 1:47:47 AM Incident Desc: Suspicious Vehicle(s) OFC: Received: 11/6/2012 1:47:47 AM
Disp: FI Location: KIRK RD&CLEAR LAKE RD
Event Number: LLS121106060944 ID: 60944 Priority: 6 Case No: """
lines = data.splitlines()
cases = ['\n'.join(lines[i:i+3]) for i in range(0, len(lines), 3)]
pattern = '(Police Response|Incident Desc|OFC|Received|Disp|Location|Event Number|ID|Priority|Case No):'
rows = []
for case in cases:
pairs = re.split(pattern, case)[1:]
rows.append(OrderedDict((pairs[i*2], pairs[i*2+1]) for i in range(10)))
for i, row in enumerate(rows):
print '============== {} =============='.format(i)
pprint(row.items())
output:
============== 0 ==============
[('Police Response', ' 11/6/2012 1:34:06 AM '),
('Incident Desc', ' Traffic Stop '),
('OFC', ' '),
('Received', ' 11/6/2012 1:34:06 AM\n'),
('Disp', ' PCHK '),
('Location', ' CLEAR LAKE RD&GREEN HILL RD\n'),
('Event Number', ' LLS121106060941 '),
('ID', ' 60941 '),
('Priority', ' 6 '),
('Case No', '')]
============== 1 ==============
[('Police Response', ' '),
('Incident Desc', ' Theft '),
('OFC', ' '),
('Received', ' 11/6/2012 1:43:35 AM\n'),
('Disp', ' CSR '),
('Location', ' SCH BLACHLY\n'),
('Event Number', ' LLS121106060943 '),
('ID', ' 60943 '),
('Priority', ' 4 '),
('Case No', '')]
============== 2 ==============
[('Police Response', ' 11/6/2012 1:47:47 AM '),
('Incident Desc', ' Suspicious Vehicle(s) '),
('OFC', ' '),
('Received', ' 11/6/2012 1:47:47 AM\n'),
('Disp', ' FI '),
('Location', ' KIRK RD&CLEAR LAKE RD\n'),
('Event Number', ' LLS121106060944 '),
('ID', ' 60944 '),
('Priority', ' 6 '),
('Case No', ' ')]
The big question:
What is used to delimit the entries? If there are tabs between entries, that makes it easy, just split each line by tab. If there always at least two spaces, you can split by that. If there's sometimes just one space, that complicates things.
Otherwise, it's easy to make a generator/function to spit out three lines at a time, which you can then throw into a function that parses the three lines. The '3-lines-at-a-time' part of your problem is the easy part.
def return_3(file):
return [file.next() for i in range(3)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With