Parsing a tab-separated file with missing fields

Question

This is an example of a complex tab separated file I'm trying to parse

ENTRY   map0010	NAME Glycolysis	DESCRIPTION Glycolysis is the process of converting glucose into pyruvate	CLASS   Metabolism	DISEASE   H00071  Hereditary fructose intolerance	H00072  Pyruvate dehydrogenase complex deficiency	DBLINKS     GO: 0006096 0006094
ENTRY   map00020	NAME  Citrate cycle (TCA cycle)	CLASS   Metabolism; Carbohydrate Metabolism	DISEASE   H00073  Pyruvate carboxylase deficiency	DBLINKS     GO: 0006099	REL_PATHWAY map00010  Glycolysis / Gluconeogenesis	map00053  Ascorbate and aldarate metabolism

I'm trying to obtain an output containing only some fields, like:

ENTRY   map0010	NAME Glycolysis	CLASS   Metabolism	DISEASE   H00071  Hereditary fructose intolerance H00072  Pyruvate dehydrogenase complex deficiency	DBLINKS     GO: 0006096 0006094	NA
ENTRY   map00020	NAME  Citrate cycle (TCA cycle)	CLASS   Metabolism; Carbohydrate Metabolism	DISEASE   H00073  Pyruvate carboxylase deficiency	DBLINKS     GO: 0006099	REL_PATHWAY map00010  Glycolysis / Gluconeogenesis	map00053  Ascorbate and aldarate metabolism

The main problem is that not all the rows contain the same number of fields, so I need to delete, for example, the fields containing the string "DESCRIPTION", and add an empty field in the rows where the field "CLASS" in not present.

Moreover for some fields the data are split in more than one (f.i, line 1 the field following DISEASE contains disease data!) and I need to join them.

I've tried with:

input = open('file', 'r')

dict = ["ENTRY", "NAME", "CLASS", "DISEASE", "DBLINKS", "REL_PATHWAY"]

split_tab = []
output = []

for line in input:
    split_tab.append(line.split('	'))

for item in dict:
    for element in split_tab:
        if item in element:
            output.append(element)
        else:
            output.append('	NA	')

But it keeps everything, not only the elements specified in dict. Could you please help me?

Spencer Rathbun · Accepted Answer

Use the built in csv library. Your job will be much easier.

For some sample code:

import csv
reader = csv.reader(open('myfile.csv', 'rb'), dialect='excel-tab')
fieldnames = ['Name','Class']
writer = csv.DictWriter(open('myfile.csv', 'rb'), fieldnames, restval='', extrasaction='ignore', dialect='excel-tab')

for row in reader:
    newrow = {}
    for field in row:
        key = field.split(' ', 1)[0]
        newrow[key] = field
    writer.writerow(newrow)

Pay particular attention to how the DictWriter is set up. It is much easier to use if you include the restval and extrasaction fields. They allow you to pass a dictionary with more or less values than the writer is expecting.

Simply have your fieldnames set appropriately, and set up the reader to use the correct dialect. This may include adding your own, but the csv link has instructions on how to do that.

EDIT

After Rob's comment posted below, I've revised this to take into account the fact that csv dialects are not as powerful as I thought.

rob mayoff · Answer

requiredKeys = 'ENTRY NAME CLASS DISEASE DBLINKS REL_PATHWAY'.split(' ')

for line in open('file', 'r'):
    fields = line.split('	')
    fieldMap = {}
    for field in fields:
        key = field.split(' ', 1)[0]
        fieldMap[key] = field
    print '	'.join([fieldMap.get(key, 'NA') for key in requiredKeys])

Parsing a tab-separated file with missing fields

Tags:

python

parsing

csv

Sonny

2 Answers

Spencer Rathbun

rob mayoff

Recent Activity

Donate For Us

Parsing a tab-separated file with missing fields

Tags:

python

parsing

csv

Sonny

2 Answers

Spencer Rathbun

rob mayoff

Related questions

Recent Activity

Donate For Us