How to efficiently parse fixed width files?

1 Answers

Using the Python standard library's struct module would be fairly easy as well as extremely fast since it's written in C.

Here's how it could be used to do what you want. It also allows columns of characters to be skipped by specifying negative values for the number of characters in the field.

import struct  fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')                         for fw in fieldwidths) fieldstruct = struct.Struct(fmtstring) parse = fieldstruct.unpack_from print('fmtstring: {!r}, recsize: {} chars'.format(fmtstring, fieldstruct.size))  line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n' fields = parse(line) print('fields: {}'.format(fields))

Output:

fmtstring: '2s 10x 24s', recsize: 36 chars fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

The following modifications would adapt it work in Python 2 or 3 (and handle Unicode input):

import struct import sys  fieldstruct = struct.Struct(fmtstring) if sys.version_info[0] < 3:     parse = fieldstruct.unpack_from else:     # converts unicode input to byte string and results back to unicode string     unpack = fieldstruct.unpack_from     parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))

Here's a way to do it with string slices, as you were considering but were concerned that it might get too ugly. The nice thing about it is, besides not being all that ugly, is that it works unchanged in both Python 2 and 3, as well as being able to handle Unicode strings. Speed-wise it is, of course, slower than the versions based the struct module, but could be sped-up slightly by removing the ability to have padding fields.

try:     from itertools import izip_longest  # added in Py 2.6 except ImportError:     from itertools import zip_longest as izip_longest  # name change in Py 3.x  try:     from itertools import accumulate  # added in Py 3.2 except ImportError:     def accumulate(iterable):         'Return running totals (simplified version).'         total = next(iterable)         yield total         for value in iterable:             total += value             yield total  def make_parser(fieldwidths):     cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))     pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields     flds = tuple(izip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one     parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)     # optional informational function attributes     parse.size = sum(abs(fw) for fw in fieldwidths)     parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')                                                 for fw in fieldwidths)     return parse  line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n' fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields parse = make_parser(fieldwidths) fields = parse(line) print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size)) print('fields: {}'.format(fields))

Output:

format: '2s 10x 24s', rec size: 36 chars fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

192

answered Sep 29 '22 15:09

martineau

Related questions
                            
                                Element-wise logical OR in Pandas
                            
                                what shebang to use for python scripts run under a pyenv virtualenv
                            
                                Support for Enum arguments in argparse
                            
                                Using super with a class method
                            
                                When is not a good time to use python generators?
                            
                                Create an empty data frame with index from another data frame
                            
                                Python Pandas replicate rows in dataframe
                            
                                Save a large file using the Python requests library [duplicate]
                            
                                ImportError: No module named mock
                            
                                Are python variables pointers? or else what are they?
                            
                                Django - taking values from POST request
                            
                                Accessing JSON elements
                            
                                How to safely open/close files in python 2.4
                            
                                Calling IPython from a virtualenv
                            
                                Where does flask look for image files?
                            
                                Error: pg_config executable not found when installing psycopg2 on Alpine in Docker
                            
                                Why do I get a "referenced before assignment" error when assigning to a global variable in a function?
                            
                                How to pass arguments in pytest by command line
                            
                                How to implement conditional string formatting?
                            
                                How to terminate a thread when main program ends?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to efficiently parse fixed width files?

Tags:

python

parsing

hyperboreean

People also ask

1 Answers

martineau

Recent Activity

Donate For Us