I have a text file that has content like this:
******** ENTRY 01 ********
ID: 01
Data1: 0.1834869385E-002
Data2: 10.9598489301
Data3: -0.1091356549E+001
Data4: 715
And then an empty line, and repeats more similar blocks, all of them with the same data fields.
I am porting to Python a C++ code, and a certain part gets the file line by line, detects the text title and then detect each field text to extract the data. This doesn't look like a smart code at all, and I think Python must have some library to parse data like this easily. After all, it almost look like a CSV!
Any idea for this?
In computing, parsing is 'an act of parsing a string or a text'. [Google Dictionary]File parsing in computer language means to give a meaning to the characters of a text file as per the formal grammar.
Data parsing is converting data from one format to another. Widely used for data structuring, it is generally done to make the existing, often unstructured, unreadable data more comprehensible.
It is very far from CSV, actually.
You can use the file as an iterator; the following generator function yields complete sections:
def load_sections(filename):
with open(filename, 'r') as infile:
line = ''
while True:
while not line.startswith('****'):
line = next(infile) # raises StopIteration, ending the generator
continue # find next entry
entry = {}
for line in infile:
line = line.strip()
if not line: break
key, value = map(str.strip, line.split(':', 1))
entry[key] = value
yield entry
This treats the file as an iterator, meaning that any looping advances the file to the next line. The outer loop only serves to move from section to section; the inner while
and for
loops do all the real work; first skip lines until a ****
header section is found (otherwise discarded), then loop over all non-empty lines to create a section.
Use the function in a loop:
for section in load_sections(filename):
print section
Repeating your sample data in a text file results in:
>>> for section in load_sections('/tmp/test.txt'):
... print section
...
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
You can add some data converters to that if you want to; a mapping of key to callable would do:
converters = {'ID': int, 'Data1': float, 'Data2': float, 'Data3': float, 'Data4': int}
then in the generator function, instead of entry[key] = value
do entry[key] = converters.get(key, lambda v: v)(value)
.
my_file:
******** ENTRY 01 ********
ID: 01
Data1: 0.1834869385E-002
Data2: 10.9598489301
Data3: -0.1091356549E+001
Data4: 715
ID: 02
Data1: 0.18348674325E-012
Data2: 10.9598489301
Data3: 0.0
Data4: 5748
ID: 03
Data1: 20.1834869385E-002
Data2: 10.954576354
Data3: 10.13476858762435E+001
Data4: 7456
Python script:
import re
with open('my_file', 'r') as f:
data = list()
group = dict()
for key, value in re.findall(r'(.*):\s*([\dE+-.]+)', f.read()):
if key in group:
data.append(group)
group = dict()
group[key] = value
data.append(group)
print data
Printed output:
[
{
'Data4': '715',
'Data1': '0.1834869385E-002',
'ID': '01',
'Data3': '-0.1091356549E+001',
'Data2': '10.9598489301'
},
{
'Data4': '5748',
'Data1': '0.18348674325E-012',
'ID': '02',
'Data3': '0.0',
'Data2': '10.9598489301'
},
{
'Data4': '7456',
'Data1': '20.1834869385E-002',
'ID': '03',
'Data3': '10.13476858762435E+001',
'Data2': '10.954576354'
}
]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With