Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently parse fixed width files?

Tags:

python

parsing

I am trying to find an efficient way of parsing files that holds fixed width lines. For example, the first 20 characters represent a column, from 21:30 another one and so on.

Assuming that the line holds 100 characters, what would be an efficient way to parse a line into several components?

I could use string slicing per line, but it's a little bit ugly if the line is big. Are there any other fast methods?

like image 243
hyperboreean Avatar asked Feb 06 '11 14:02

hyperboreean


People also ask

How do I read a fixed width file in R?

If we want to read a fixed width text file into R (or RStudio), we can use the read. fwf function. Within the read. fwf function, we have to specify the location of the file and the staring points of the data, i.e. from which line the data is shown and at which points new columns start.

How do you convert fixed width to delimited in Python?

You can convert a fixed-width file to a CSV using Python pandas by reading the fixed-width file as a DataFrame df using pd. read('my_file. fwf') and writing the DataFrame to a CSV using df. to_csv('my_file.

Is CSV fixed width?

Fixed-width is a file format where data is arranged in columns, but instead of those columns being delimited by a certain character (as they are in CSV) every row is the exact same length.


1 Answers

Using the Python standard library's struct module would be fairly easy as well as extremely fast since it's written in C.

Here's how it could be used to do what you want. It also allows columns of characters to be skipped by specifying negative values for the number of characters in the field.

import struct  fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')                         for fw in fieldwidths) fieldstruct = struct.Struct(fmtstring) parse = fieldstruct.unpack_from print('fmtstring: {!r}, recsize: {} chars'.format(fmtstring, fieldstruct.size))  line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n' fields = parse(line) print('fields: {}'.format(fields)) 

Output:

fmtstring: '2s 10x 24s', recsize: 36 chars fields: ('AB', 'MNOPQRSTUVWXYZ0123456789') 

The following modifications would adapt it work in Python 2 or 3 (and handle Unicode input):

import struct import sys  fieldstruct = struct.Struct(fmtstring) if sys.version_info[0] < 3:     parse = fieldstruct.unpack_from else:     # converts unicode input to byte string and results back to unicode string     unpack = fieldstruct.unpack_from     parse = lambda line: tuple(s.decode() for s in unpack(line.encode())) 

Here's a way to do it with string slices, as you were considering but were concerned that it might get too ugly. The nice thing about it is, besides not being all that ugly, is that it works unchanged in both Python 2 and 3, as well as being able to handle Unicode strings. Speed-wise it is, of course, slower than the versions based the struct module, but could be sped-up slightly by removing the ability to have padding fields.

try:     from itertools import izip_longest  # added in Py 2.6 except ImportError:     from itertools import zip_longest as izip_longest  # name change in Py 3.x  try:     from itertools import accumulate  # added in Py 3.2 except ImportError:     def accumulate(iterable):         'Return running totals (simplified version).'         total = next(iterable)         yield total         for value in iterable:             total += value             yield total  def make_parser(fieldwidths):     cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))     pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields     flds = tuple(izip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one     parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)     # optional informational function attributes     parse.size = sum(abs(fw) for fw in fieldwidths)     parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')                                                 for fw in fieldwidths)     return parse  line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n' fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields parse = make_parser(fieldwidths) fields = parse(line) print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size)) print('fields: {}'.format(fields)) 

Output:

format: '2s 10x 24s', rec size: 36 chars fields: ('AB', 'MNOPQRSTUVWXYZ0123456789') 
like image 192
martineau Avatar answered Sep 29 '22 15:09

martineau