I am trying to find an efficient way of parsing files that holds fixed width lines. For example, the first 20 characters represent a column, from 21:30 another one and so on.
Assuming that the line holds 100 characters, what would be an efficient way to parse a line into several components?
I could use string slicing per line, but it's a little bit ugly if the line is big. Are there any other fast methods?
If we want to read a fixed width text file into R (or RStudio), we can use the read. fwf function. Within the read. fwf function, we have to specify the location of the file and the staring points of the data, i.e. from which line the data is shown and at which points new columns start.
You can convert a fixed-width file to a CSV using Python pandas by reading the fixed-width file as a DataFrame df using pd. read('my_file. fwf') and writing the DataFrame to a CSV using df. to_csv('my_file.
Fixed-width is a file format where data is arranged in columns, but instead of those columns being delimited by a certain character (as they are in CSV) every row is the exact same length.
Using the Python standard library's struct
module would be fairly easy as well as extremely fast since it's written in C.
Here's how it could be used to do what you want. It also allows columns of characters to be skipped by specifying negative values for the number of characters in the field.
import struct fieldwidths = (2, -10, 24) # negative widths represent ignored padding fields fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths) fieldstruct = struct.Struct(fmtstring) parse = fieldstruct.unpack_from print('fmtstring: {!r}, recsize: {} chars'.format(fmtstring, fieldstruct.size)) line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n' fields = parse(line) print('fields: {}'.format(fields))
Output:
fmtstring: '2s 10x 24s', recsize: 36 chars fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
The following modifications would adapt it work in Python 2 or 3 (and handle Unicode input):
import struct import sys fieldstruct = struct.Struct(fmtstring) if sys.version_info[0] < 3: parse = fieldstruct.unpack_from else: # converts unicode input to byte string and results back to unicode string unpack = fieldstruct.unpack_from parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))
Here's a way to do it with string slices, as you were considering but were concerned that it might get too ugly. The nice thing about it is, besides not being all that ugly, is that it works unchanged in both Python 2 and 3, as well as being able to handle Unicode strings. Speed-wise it is, of course, slower than the versions based the struct
module, but could be sped-up slightly by removing the ability to have padding fields.
try: from itertools import izip_longest # added in Py 2.6 except ImportError: from itertools import zip_longest as izip_longest # name change in Py 3.x try: from itertools import accumulate # added in Py 3.2 except ImportError: def accumulate(iterable): 'Return running totals (simplified version).' total = next(iterable) yield total for value in iterable: total += value yield total def make_parser(fieldwidths): cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths)) pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields flds = tuple(izip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad) # optional informational function attributes parse.size = sum(abs(fw) for fw in fieldwidths) parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths) return parse line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n' fieldwidths = (2, -10, 24) # negative widths represent ignored padding fields parse = make_parser(fieldwidths) fields = parse(line) print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size)) print('fields: {}'.format(fields))
Output:
format: '2s 10x 24s', rec size: 36 chars fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With