I have a text file that has content like this: <pre class="prettyprint"><code>******** ENTRY 01 ******** ID: 01 Data1: 0.1834869385E-002 Data2: 10.9598489301 Data3: -0.1091356549E+001 Data4: 715 </code></pre> And then an empty line, and repeats more similar blocks, all of them with the same data fields. I am porting to Python a C++ code, and a certain part gets the file line by line, detects the text title and then detect each field text to extract the data. This doesn't look like a smart code at all, and I think Python must have some library to parse data like this easily. After all, it almost look like a CSV! Any idea for this?

It is very far from CSV, actually. You can use the file as an iterator; the following generator function yields complete sections: <pre class="prettyprint"><code>def load_sections(filename): with open(filename, 'r') as infile: line = '' while True: while not line.startswith('****'): line = next(infile) # raises StopIteration, ending the generator continue # find next entry entry = {} for line in infile: line = line.strip() if not line: break key, value = map(str.strip, line.split(':', 1)) entry[key] = value yield entry </code></pre> This treats the file as an iterator, meaning that any looping advances the file to the next line. The outer loop only serves to move from section to section; the inner <code>while</code> and <code>for</code> loops do all the real work; first skip lines until a <code>****</code> header section is found (otherwise discarded), then loop over all non-empty lines to create a section. Use the function in a loop: <pre class="prettyprint"><code>for section in load_sections(filename): print section </code></pre> Repeating your sample data in a text file results in: <pre class="prettyprint"><code>>>> for section in load_sections('/tmp/test.txt'): ... print section ... {'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'} {'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'} {'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'} </code></pre> You can add some data converters to that if you want to; a mapping of key to callable would do: <pre class="prettyprint"><code>converters = {'ID': int, 'Data1': float, 'Data2': float, 'Data3': float, 'Data4': int} </code></pre> then in the generator function, instead of <code>entry[key] = value</code> do <code>entry[key] = converters.get(key, lambda v: v)(value)</code>.

Parsing data from text file

Tags:

python

file

parsing

I have a text file that has content like this:

******** ENTRY 01 ********
ID:                  01
Data1:               0.1834869385E-002
Data2:              10.9598489301
Data3:              -0.1091356549E+001
Data4:                715

And then an empty line, and repeats more similar blocks, all of them with the same data fields.

I am porting to Python a C++ code, and a certain part gets the file line by line, detects the text title and then detect each field text to extract the data. This doesn't look like a smart code at all, and I think Python must have some library to parse data like this easily. After all, it almost look like a CSV!

Any idea for this?

400

asked Jun 14 '13 09:06

Roman Rdgz

2 Answers

It is very far from CSV, actually.

You can use the file as an iterator; the following generator function yields complete sections:

def load_sections(filename):
    with open(filename, 'r') as infile:
        line = ''
        while True:
            while not line.startswith('****'): 
                line = next(infile)  # raises StopIteration, ending the generator
                continue  # find next entry

            entry = {}
            for line in infile:
                line = line.strip()
                if not line: break

                key, value = map(str.strip, line.split(':', 1))
                entry[key] = value

            yield entry

This treats the file as an iterator, meaning that any looping advances the file to the next line. The outer loop only serves to move from section to section; the inner while and for loops do all the real work; first skip lines until a **** header section is found (otherwise discarded), then loop over all non-empty lines to create a section.

Use the function in a loop:

for section in load_sections(filename):
    print section

Repeating your sample data in a text file results in:

>>> for section in load_sections('/tmp/test.txt'):
...     print section
... 
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}

You can add some data converters to that if you want to; a mapping of key to callable would do:

converters = {'ID': int, 'Data1': float, 'Data2': float, 'Data3': float, 'Data4': int}

then in the generator function, instead of entry[key] = value do entry[key] = converters.get(key, lambda v: v)(value).

106

answered Oct 10 '22 00:10

Martijn Pieters

my_file:

******** ENTRY 01 ********
ID:                  01
Data1:               0.1834869385E-002
Data2:              10.9598489301
Data3:              -0.1091356549E+001
Data4:                715

ID:                  02
Data1:               0.18348674325E-012
Data2:              10.9598489301
Data3:              0.0
Data4:                5748

ID:                  03
Data1:               20.1834869385E-002
Data2:              10.954576354
Data3:              10.13476858762435E+001
Data4:                7456

Python script:

import re

with open('my_file', 'r') as f:
    data  = list()
    group = dict()
    for key, value in re.findall(r'(.*):\s*([\dE+-.]+)', f.read()):
        if key in group:
            data.append(group)
            group = dict()
        group[key] = value
    data.append(group)

print data

Printed output:

[
    {
        'Data4': '715',
        'Data1': '0.1834869385E-002',
        'ID': '01',
        'Data3': '-0.1091356549E+001',
        'Data2': '10.9598489301'
    },
    {
        'Data4': '5748',
        'Data1': '0.18348674325E-012',
        'ID': '02',
        'Data3': '0.0',
        'Data2': '10.9598489301'
    },
    {
        'Data4': '7456',
        'Data1': '20.1834869385E-002',
        'ID': '03',
        'Data3': '10.13476858762435E+001',
        'Data2': '10.954576354'
    }
]

answered Oct 10 '22 00:10

Peter Varo

Related questions
                            
                                psycopg2 "IndexError: tuple index out of range" Error when using '%' like operator with arguments tuple
                            
                                Python in Windows Store apps
                            
                                Graph traversal with Networkx (Python)
                            
                                How to draw with Vertex Array Objects and glDrawElements in PyOpenGL
                            
                                How to set up and solve simultaneous equations in python
                            
                                Display an image located in the database in Django
                            
                                Django HTTP 500 Error
                            
                                Fill the right column of a matplotlib legend first
                            
                                Flask Database Issue
                            
                                Django query single underscore behaving like double underscore?
                            
                                How to remove escape sequence like '\xe2' or '\x0c' in python
                            
                                askopenfilename handling cancel on dialogue
                            
                                does calling a shell command from within a scripting language slow down performance?
                            
                                django serializers to json - custom json output format
                            
                                How do I compare 2D lists for equality in Python?
                            
                                How to show a window that was hidden using "withdraw" method?
                            
                                Using pandas to read text file with leading whitespace gives a NaN column
                            
                                Why is creating a range from 0 to log(len(list), 2) so slow?
                            
                                Why Cant I Click an Element in Selenium?
                            
                                Dealing with trying to read a file that might not exist

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With