Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse info from a file based off of a template file in python

Tags:

I have a bunch of text files, all in the same format (this is a snippet below, real file is longer):

Molecular weight = 43057.32         Residues = 391   
Average Residue Weight  = 110.121   Charge   = -10.0
Isoelectric Point = 4.8926

Residue     Number      Mole%       DayhoffStat
A = Ala     24      6.138       0.714   
B = Asx     0       0.000       0.000   
C = Cys     9       2.302       0.794   

Property    Residues        Number      Mole%
Tiny        (A+C+G+S+T)     135     34.527
Small       (A+B+C+D+G+N+P+S+T+V)   222     56.777
Aliphatic   (A+I+L+V)       97      24.808

I have to extract all these variables and process them. I was going to write some code that goes through each line one at a time and extracts the relevant info through a series of split, strip etc. functions.

This is a pretty common task people do with python, so i got to thinking there must be an easier method for this.

Is there any module or method out there to allow something like:

template = """
Molecular weight = {0}          Residues = {1}   
Average Residue Weight  = {2}   Charge   = {3}
Isoelectric Point = {4}

Residue     Number      Mole%       DayhoffStat
A = Ala     {4}     {5}         {6}
B = Asx     {7}     {8}         {9}     
C = Cys     {10}    {11}        {12}    

Property    Residues        Number      Mole%
Tiny        (A+C+G+S+T)     {14}        {15}
Small       (A+B+C+D+G+N+P+S+T+V)   {16}        {17}
Aliphatic   (A+I+L+V)       {18}        {19}"""

and then, to extract the vars from a another input file following the above format you would do the following:

list_of_vars = Parse(template, infile)

Note that while the same variable will appear in every file on the same line, they can be shifted a few characters to the right depending on how big the value preceding it on the same line is.

The files are the output from emboss pepstats in case anyone was wondering.

SOLUTION

Thanks everyone for the quick replies. The solution here was to use findall function in the re module. Here is a simple example below:

import re

class TemplateParser:
    def __init__(self, template):
        self.m_template = template.replace('{}', r'[\s]*([\d\-\.]+)[\s]*')

    def ParseString(self, filename):
        return re.findall(self.m_template, filename, re.DOTALL|re.MULTILINE)[0]

template = """
Molecular weight = {}          Residues = {}   
Average Residue Weight  = {}   Charge   = {}
Isoelectric Point = {}

Residue     Number      Mole%       DayhoffStat
A = Ala     {}    {}        {}
B = Asx     {}    {}        {}     
C = Cys     {}    {}        {}    

Property    Residues        Number      Mole%
Tiny        \(A\+C\+G\+S\+T\)     {}        {}
Small       \(A\+B\+C\+D\+G\+N\+P\+S\+T\+V\)   {}        {}
Aliphatic   \(A\+I\+L\+V\)       {}        {}"""

The ParseString function the successfully returns a list of strings which I can then process. As the files are always the same format, I was able to process all my files successfully. I only had two issues.

1) As you can see above i've had so escape all regex characters in my template file, which isn't that big of a deal.

2) As I also mentioned above, this template is just a small snippit of the actual files I need to parse. When I tried this with my real data, re threw the following error:

"sorry, but this version only supports 100 named groups" AssertionError: sorry, but this version only supports 100 named groups

I worked around this by splitting my template string into 3 pieces, ran the ParseString function 3 times with the 3 different templates, and added the list results together.

Thanks again!

like image 482
lotuspaperboy Avatar asked Sep 26 '17 15:09

lotuspaperboy


2 Answers

I can see this thread was answered long ago, but come out with same idea as OP - use templates to parse text data - ended up creating Template Text Parser module: https://ttp.readthedocs.io/en/latest/

Sample python code to parse OP's text:

template = """
<group>
Molecular weight = {{ Molecular_weight }}         Residues = {{ Residues }}   
Average Residue Weight  = {{ Average_Residue_Weight }}   Charge   = {{ Charge }}
Isoelectric Point = {{ Isoelectric_Point }}

<group name="table1">
## Residue              Number                Mole%       DayhoffStat
{{ Residue | PHRASE }}  {{ Number | DIGIT }}  {{ Mole }}  {{ DayhoffStat }}
</group>

<group name="table2">
## Property     Residues        Number                Mole%
{{ Property }}  {{ Residues }}  {{ Number | DIGIT }}  {{ Mole }}
</group>
</group>
"""

sample_data = """
Molecular weight = 43057.32         Residues = 391   
Average Residue Weight  = 110.121   Charge   = -10.0
Isoelectric Point = 4.8926

Residue     Number      Mole%       DayhoffStat
A = Ala     24      6.138       0.714   
B = Asx     0       0.000       0.000   
C = Cys     9       2.302       0.794   

Property    Residues        Number      Mole%
Tiny        (A+C+G+S+T)     135     34.527
Small       (A+B+C+D+G+N+P+S+T+V)   222     56.777
Aliphatic   (A+I+L+V)       97      24.808
"""

from ttp import ttp
parser = ttp(sample_data, template)
result = parser.result(format="pprint")
print(result[0])

Will produce:

[   {   'Average_Residue_Weight': '110.121',
        'Charge': '-10.0',
        'Isoelectric_Point': '4.8926',
        'Molecular_weight': '43057.32',
        'Residues': '391',
        'table1': [   {   'DayhoffStat': '0.714',
                          'Mole': '6.138',
                          'Number': '24',
                          'Residue': 'A = Ala'},
                      {   'DayhoffStat': '0.000',
                          'Mole': '0.000',
                          'Number': '0',
                          'Residue': 'B = Asx'},
                      {   'DayhoffStat': '0.794',
                          'Mole': '2.302',
                          'Number': '9',
                          'Residue': 'C = Cys'}],
        'table2': [   {   'Mole': '34.527',
                          'Number': '135',
                          'Property': 'Tiny',
                          'Residues': '(A+C+G+S+T)'},
                      {   'Mole': '56.777',
                          'Number': '222',
                          'Property': 'Small',
                          'Residues': '(A+B+C+D+G+N+P+S+T+V)'},
                      {   'Mole': '24.808',
                          'Number': '97',
                          'Property': 'Aliphatic',
                          'Residues': '(A+I+L+V)'}]}]
like image 79
apraksim Avatar answered Oct 12 '22 09:10

apraksim


Here's a rough start

In [3]: data = """Molecular weight = 43057.32         Residues = 391   
   ...: Average Residue Weight  = 110.121   Charge   = -10.0
   ...: Isoelectric Point = 4.8926
   ...: 
   ...: Residue     Number      Mole%       DayhoffStat
   ...: A = Ala     24      6.138       0.714   
   ...: B = Asx     0       0.000       0.000   
   ...: C = Cys     9       2.302       0.794   
   ...: 
   ...: Property    Residues        Number      Mole%
   ...: Tiny        (A+C+G+S+T)     135     34.527
   ...: Small       (A+B+C+D+G+N+P+S+T+V)   222     56.777
   ...: Aliphatic   (A+I+L+V)       97      24.808
   ...: """
In [5]: rx=r'Molecular weight += +([0-9\.]+).*Residues += +([0-9]+).*Average Residue Weight += +([0-9\.]+).*Charge += +([-+]*[0-9\.]+)'
     rx=r'Molecular weight += +([0-9\.]+).*Residues += +([0-9]+).*Average Residue Weight += +([0-9\.]+).*Charge += +([-+]*[0-9\.]+)'
In [7]: import re
In [12]: re.findall(rx, data, re.DOTALL|re.MULTILINE)
Out[12]: [('43057.32', '391', '110.121', '-10.0')]

As you can see, this extracts the first 4 fields from the file. If you truly have a fixed format file like this, you can extend the regex to get all the data in one call.

You;ll need to polish the sub-expressions for getting the correct floating point formats etc - as I said, this was a quick proof-of-concept. And the RE might become ridiculously long or hard to debug if the real files are significantly larger.

Just for comparison, here's what you get for the same data using the regex provided by ctwheels in their comment

In [13]: rx2='(?:\s*([a-zA-Z()+ ]+?)[ =]*)([-+]?\d+\.?\d*)'

In [14]: re.findall(rx2,data)
Out[14]: 
[('Molecular weight', '43057.32'),
 ('Residues', '391'),
 ('Average Residue Weight', '110.121'),
 ('Charge', '-10.0'),
 ('Isoelectric Point', '4.8926'),
 ('Ala', '24'),
 (' ', '6.138'),
 (' ', '0.714'),
 ('Asx', '0'),
 (' ', '0.000'),
 (' ', '0.000'),
 ('Cys', '9'),
 (' ', '2.302'),
 (' ', '0.794'),
 ('Tiny        (A+C+G+S+T)', '135'),
 (' ', '34.527'),
 ('Small       (A+B+C+D+G+N+P+S+T+V)', '222'),
 (' ', '56.777'),
 ('Aliphatic   (A+I+L+V)', '97'),
 (' ', '24.808')]
In [15]: [m[1] for m in _]
Out[15]: 
['43057.32',
 '391',
 '110.121',
 '-10.0',
 '4.8926',
 '24',
 '6.138',
 '0.714',
 '0',
 '0.000',
 '0.000',
 '9',
 '2.302',
 '0.794',
 '135',
 '34.527',
 '222',
 '56.777',
 '97',
 '24.808']

Which might be good enough

like image 29
kdopen Avatar answered Oct 12 '22 09:10

kdopen