I have a bunch of text files, all in the same format (this is a snippet below, real file is longer):
Molecular weight = 43057.32 Residues = 391
Average Residue Weight = 110.121 Charge = -10.0
Isoelectric Point = 4.8926
Residue Number Mole% DayhoffStat
A = Ala 24 6.138 0.714
B = Asx 0 0.000 0.000
C = Cys 9 2.302 0.794
Property Residues Number Mole%
Tiny (A+C+G+S+T) 135 34.527
Small (A+B+C+D+G+N+P+S+T+V) 222 56.777
Aliphatic (A+I+L+V) 97 24.808
I have to extract all these variables and process them. I was going to write some code that goes through each line one at a time and extracts the relevant info through a series of split, strip etc. functions.
This is a pretty common task people do with python, so i got to thinking there must be an easier method for this.
Is there any module or method out there to allow something like:
template = """
Molecular weight = {0} Residues = {1}
Average Residue Weight = {2} Charge = {3}
Isoelectric Point = {4}
Residue Number Mole% DayhoffStat
A = Ala {4} {5} {6}
B = Asx {7} {8} {9}
C = Cys {10} {11} {12}
Property Residues Number Mole%
Tiny (A+C+G+S+T) {14} {15}
Small (A+B+C+D+G+N+P+S+T+V) {16} {17}
Aliphatic (A+I+L+V) {18} {19}"""
and then, to extract the vars from a another input file following the above format you would do the following:
list_of_vars = Parse(template, infile)
Note that while the same variable will appear in every file on the same line, they can be shifted a few characters to the right depending on how big the value preceding it on the same line is.
The files are the output from emboss pepstats in case anyone was wondering.
Thanks everyone for the quick replies. The solution here was to use findall function in the re module. Here is a simple example below:
import re
class TemplateParser:
def __init__(self, template):
self.m_template = template.replace('{}', r'[\s]*([\d\-\.]+)[\s]*')
def ParseString(self, filename):
return re.findall(self.m_template, filename, re.DOTALL|re.MULTILINE)[0]
template = """
Molecular weight = {} Residues = {}
Average Residue Weight = {} Charge = {}
Isoelectric Point = {}
Residue Number Mole% DayhoffStat
A = Ala {} {} {}
B = Asx {} {} {}
C = Cys {} {} {}
Property Residues Number Mole%
Tiny \(A\+C\+G\+S\+T\) {} {}
Small \(A\+B\+C\+D\+G\+N\+P\+S\+T\+V\) {} {}
Aliphatic \(A\+I\+L\+V\) {} {}"""
The ParseString function the successfully returns a list of strings which I can then process. As the files are always the same format, I was able to process all my files successfully. I only had two issues.
1) As you can see above i've had so escape all regex characters in my template file, which isn't that big of a deal.
2) As I also mentioned above, this template is just a small snippit of the actual files I need to parse. When I tried this with my real data, re threw the following error:
"sorry, but this version only supports 100 named groups" AssertionError: sorry, but this version only supports 100 named groups
I worked around this by splitting my template string into 3 pieces, ran the ParseString function 3 times with the 3 different templates, and added the list results together.
Thanks again!
I can see this thread was answered long ago, but come out with same idea as OP - use templates to parse text data - ended up creating Template Text Parser module: https://ttp.readthedocs.io/en/latest/
Sample python code to parse OP's text:
template = """
<group>
Molecular weight = {{ Molecular_weight }} Residues = {{ Residues }}
Average Residue Weight = {{ Average_Residue_Weight }} Charge = {{ Charge }}
Isoelectric Point = {{ Isoelectric_Point }}
<group name="table1">
## Residue Number Mole% DayhoffStat
{{ Residue | PHRASE }} {{ Number | DIGIT }} {{ Mole }} {{ DayhoffStat }}
</group>
<group name="table2">
## Property Residues Number Mole%
{{ Property }} {{ Residues }} {{ Number | DIGIT }} {{ Mole }}
</group>
</group>
"""
sample_data = """
Molecular weight = 43057.32 Residues = 391
Average Residue Weight = 110.121 Charge = -10.0
Isoelectric Point = 4.8926
Residue Number Mole% DayhoffStat
A = Ala 24 6.138 0.714
B = Asx 0 0.000 0.000
C = Cys 9 2.302 0.794
Property Residues Number Mole%
Tiny (A+C+G+S+T) 135 34.527
Small (A+B+C+D+G+N+P+S+T+V) 222 56.777
Aliphatic (A+I+L+V) 97 24.808
"""
from ttp import ttp
parser = ttp(sample_data, template)
result = parser.result(format="pprint")
print(result[0])
Will produce:
[ { 'Average_Residue_Weight': '110.121',
'Charge': '-10.0',
'Isoelectric_Point': '4.8926',
'Molecular_weight': '43057.32',
'Residues': '391',
'table1': [ { 'DayhoffStat': '0.714',
'Mole': '6.138',
'Number': '24',
'Residue': 'A = Ala'},
{ 'DayhoffStat': '0.000',
'Mole': '0.000',
'Number': '0',
'Residue': 'B = Asx'},
{ 'DayhoffStat': '0.794',
'Mole': '2.302',
'Number': '9',
'Residue': 'C = Cys'}],
'table2': [ { 'Mole': '34.527',
'Number': '135',
'Property': 'Tiny',
'Residues': '(A+C+G+S+T)'},
{ 'Mole': '56.777',
'Number': '222',
'Property': 'Small',
'Residues': '(A+B+C+D+G+N+P+S+T+V)'},
{ 'Mole': '24.808',
'Number': '97',
'Property': 'Aliphatic',
'Residues': '(A+I+L+V)'}]}]
Here's a rough start
In [3]: data = """Molecular weight = 43057.32 Residues = 391
...: Average Residue Weight = 110.121 Charge = -10.0
...: Isoelectric Point = 4.8926
...:
...: Residue Number Mole% DayhoffStat
...: A = Ala 24 6.138 0.714
...: B = Asx 0 0.000 0.000
...: C = Cys 9 2.302 0.794
...:
...: Property Residues Number Mole%
...: Tiny (A+C+G+S+T) 135 34.527
...: Small (A+B+C+D+G+N+P+S+T+V) 222 56.777
...: Aliphatic (A+I+L+V) 97 24.808
...: """
In [5]: rx=r'Molecular weight += +([0-9\.]+).*Residues += +([0-9]+).*Average Residue Weight += +([0-9\.]+).*Charge += +([-+]*[0-9\.]+)'
rx=r'Molecular weight += +([0-9\.]+).*Residues += +([0-9]+).*Average Residue Weight += +([0-9\.]+).*Charge += +([-+]*[0-9\.]+)'
In [7]: import re
In [12]: re.findall(rx, data, re.DOTALL|re.MULTILINE)
Out[12]: [('43057.32', '391', '110.121', '-10.0')]
As you can see, this extracts the first 4 fields from the file. If you truly have a fixed format file like this, you can extend the regex to get all the data in one call.
You;ll need to polish the sub-expressions for getting the correct floating point formats etc - as I said, this was a quick proof-of-concept. And the RE might become ridiculously long or hard to debug if the real files are significantly larger.
Just for comparison, here's what you get for the same data using the regex provided by ctwheels in their comment
In [13]: rx2='(?:\s*([a-zA-Z()+ ]+?)[ =]*)([-+]?\d+\.?\d*)'
In [14]: re.findall(rx2,data)
Out[14]:
[('Molecular weight', '43057.32'),
('Residues', '391'),
('Average Residue Weight', '110.121'),
('Charge', '-10.0'),
('Isoelectric Point', '4.8926'),
('Ala', '24'),
(' ', '6.138'),
(' ', '0.714'),
('Asx', '0'),
(' ', '0.000'),
(' ', '0.000'),
('Cys', '9'),
(' ', '2.302'),
(' ', '0.794'),
('Tiny (A+C+G+S+T)', '135'),
(' ', '34.527'),
('Small (A+B+C+D+G+N+P+S+T+V)', '222'),
(' ', '56.777'),
('Aliphatic (A+I+L+V)', '97'),
(' ', '24.808')]
In [15]: [m[1] for m in _]
Out[15]:
['43057.32',
'391',
'110.121',
'-10.0',
'4.8926',
'24',
'6.138',
'0.714',
'0',
'0.000',
'0.000',
'9',
'2.302',
'0.794',
'135',
'34.527',
'222',
'56.777',
'97',
'24.808']
Which might be good enough
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With