Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing unstructured text in Python

I wanted to parse a text file that contains unstructured text. I need to get the address, date of birth, name, sex, and ID.

. 55 MORILLO ZONE VIII,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
F
01/16/1952
ALOMO, TERESITA CABALLES
3412-00000-A1652TCA2
12    
. 22 FABRICANTE ST. ZONE
VIII LUISIANA LAGROS,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
M
10/14/1967
AMURAO, CALIXTO MANALO13

In the example above, the first 3 lines is the address, the line with just an "F" is the sex, the DOB would be the line after "F", name after the DOB, the ID after the name, and the no. 12 under the ID is the index/record no.

However, the format is not consistent. In the second group, the address is 4 lines instead of 3 and the index/record no. is appended after the name (if the person doesn't have an ID field).

I wanted to rewrite the text into the following format:

name, ID, address, sex, DOB
like image 219
FrancisV Avatar asked Nov 27 '22 23:11

FrancisV


2 Answers

Here is a first stab at a pyparsing solution (easy-to-copy code at the pyparsing pastebin). Walk through the separate parts, according to the interleaved comments.

data = """\
. 55 MORILLO ZONE VIII,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
F
01/16/1952
ALOMO, TERESITA CABALLES
3412-00000-A1652TCA2
12
. 22 FABRICANTE ST. ZONE
VIII LUISIANA LAGROS,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
M
10/14/1967
AMURAO, CALIXTO MANALO13
"""

from pyparsing import LineEnd, oneOf, Word, nums, Combine, restOfLine, \
    alphanums, Suppress, empty, originalTextFor, OneOrMore, alphas, \
    Group, ZeroOrMore

NL = LineEnd().suppress()
gender = oneOf("M F")
integer = Word(nums)
date = Combine(integer + '/' + integer + '/' + integer)

# define the simple line definitions
gender_line = gender("sex") + NL
dob_line = date("DOB") + NL
name_line = restOfLine("name") + NL
id_line = Word(alphanums+"-")("ID") + NL
recnum_line = integer("recnum") + NL

# define forms of address lines
first_addr_line = Suppress('.') + empty + restOfLine + NL
# a subsequent address line is any line that is not a gender definition
subsq_addr_line = ~(gender_line) + restOfLine + NL

# a line with a name and a recnum combined, if there is no ID
name_recnum_line = originalTextFor(OneOrMore(Word(alphas+',')))("name") + \
    integer("recnum") + NL

# defining the form of an overall record, either with or without an ID
record = Group((first_addr_line + ZeroOrMore(subsq_addr_line))("address") + 
    gender_line + 
    dob_line +
    ((name_line +
        id_line + 
        recnum_line) |
      name_recnum_line))

# parse data
records = OneOrMore(record).parseString(data)

# output the desired results (note that address is actually a list of lines)
for rec in records:
    if rec.ID:
        print "%(name)s, %(ID)s, %(address)s, %(sex)s, %(DOB)s" % rec
    else:
        print "%(name)s, , %(address)s, %(sex)s, %(DOB)s" % rec
print

# how to access the individual fields of the parsed record
for rec in records:
    print rec.dump()
    print rec.name, 'is', rec.sex
    print

Prints:

ALOMO, TERESITA CABALLES, 3412-00000-A1652TCA2, ['55 MORILLO ZONE VIII,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS'], F, 01/16/1952
AMURAO, CALIXTO MANALO, , ['22 FABRICANTE ST. ZONE', 'VIII LUISIANA LAGROS,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS'], M, 10/14/1967

['55 MORILLO ZONE VIII,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS', 'F', '01/16/1952', 'ALOMO, TERESITA CABALLES', '3412-00000-A1652TCA2', '12']
- DOB: 01/16/1952
- ID: 3412-00000-A1652TCA2
- address: ['55 MORILLO ZONE VIII,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS']
- name: ALOMO, TERESITA CABALLES
- recnum: 12
- sex: F
ALOMO, TERESITA CABALLES is F

['22 FABRICANTE ST. ZONE', 'VIII LUISIANA LAGROS,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS', 'M', '10/14/1967', 'AMURAO, CALIXTO MANALO', '13']
- DOB: 10/14/1967
- address: ['22 FABRICANTE ST. ZONE', 'VIII LUISIANA LAGROS,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS']
- name: AMURAO, CALIXTO MANALO
- recnum: 13
- sex: M
AMURAO, CALIXTO MANALO is M
like image 168
PaulMcG Avatar answered Dec 05 '22 16:12

PaulMcG


you have to exploit whatever regularity and structure the text does have.

I suggest you read one line at a time and match it to a regular expression to determine its type, fill in the appropriate field in a person object. writing out that object and starting a new one whenever you get a field that you already have filled in.

like image 20
Nathan Avatar answered Dec 05 '22 15:12

Nathan