This is an example of a complex tab separated file I'm trying to parse
ENTRY map0010\tNAME Glycolysis\tDESCRIPTION Glycolysis is the process of converting glucose into pyruvate\tCLASS Metabolism\tDISEASE H00071 Hereditary fructose intolerance\tH00072 Pyruvate dehydrogenase complex deficiency\tDBLINKS GO: 0006096 0006094
ENTRY map00020\tNAME Citrate cycle (TCA cycle)\tCLASS Metabolism; Carbohydrate Metabolism\tDISEASE H00073 Pyruvate carboxylase deficiency\tDBLINKS GO: 0006099\tREL_PATHWAY map00010 Glycolysis / Gluconeogenesis\tmap00053 Ascorbate and aldarate metabolism
I'm trying to obtain an output containing only some fields, like:
ENTRY map0010\tNAME Glycolysis\tCLASS Metabolism\tDISEASE H00071 Hereditary fructose intolerance H00072 Pyruvate dehydrogenase complex deficiency\tDBLINKS GO: 0006096 0006094\tNA
ENTRY map00020\tNAME Citrate cycle (TCA cycle)\tCLASS Metabolism; Carbohydrate Metabolism\tDISEASE H00073 Pyruvate carboxylase deficiency\tDBLINKS GO: 0006099\tREL_PATHWAY map00010 Glycolysis / Gluconeogenesis\tmap00053 Ascorbate and aldarate metabolism
The main problem is that not all the rows contain the same number of fields, so I need to delete, for example, the fields containing the string "DESCRIPTION", and add an empty field in the rows where the field "CLASS" in not present.
Moreover for some fields the data are split in more than one (f.i, line 1 the field following DISEASE contains disease data!) and I need to join them.
I've tried with:
input = open('file', 'r')
dict = ["ENTRY", "NAME", "CLASS", "DISEASE", "DBLINKS", "REL_PATHWAY"]
split_tab = []
output = []
for line in input:
split_tab.append(line.split('\t'))
for item in dict:
for element in split_tab:
if item in element:
output.append(element)
else:
output.append('\tNA\t')
But it keeps everything, not only the elements specified in dict. Could you please help me?
Use the built in csv library. Your job will be much easier.
For some sample code:
import csv
reader = csv.reader(open('myfile.csv', 'rb'), dialect='excel-tab')
fieldnames = ['Name','Class']
writer = csv.DictWriter(open('myfile.csv', 'rb'), fieldnames, restval='', extrasaction='ignore', dialect='excel-tab')
for row in reader:
newrow = {}
for field in row:
key = field.split(' ', 1)[0]
newrow[key] = field
writer.writerow(newrow)
Pay particular attention to how the DictWriter is set up. It is much easier to use if you include the restval
and extrasaction
fields. They allow you to pass a dictionary with more or less values than the writer is expecting.
Simply have your fieldnames set appropriately, and set up the reader to use the correct dialect. This may include adding your own, but the csv link has instructions on how to do that.
EDIT
After Rob's comment posted below, I've revised this to take into account the fact that csv dialects are not as powerful as I thought.
requiredKeys = 'ENTRY NAME CLASS DISEASE DBLINKS REL_PATHWAY'.split(' ')
for line in open('file', 'r'):
fields = line.split('\t')
fieldMap = {}
for field in fields:
key = field.split(' ', 1)[0]
fieldMap[key] = field
print '\t'.join([fieldMap.get(key, 'NA') for key in requiredKeys])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With