Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a tab-separated file with missing fields

This is an example of a complex tab separated file I'm trying to parse

ENTRY   map0010\tNAME Glycolysis\tDESCRIPTION Glycolysis is the process of converting glucose into pyruvate\tCLASS   Metabolism\tDISEASE   H00071  Hereditary fructose intolerance\tH00072  Pyruvate dehydrogenase complex deficiency\tDBLINKS     GO: 0006096 0006094
ENTRY   map00020\tNAME  Citrate cycle (TCA cycle)\tCLASS   Metabolism; Carbohydrate Metabolism\tDISEASE   H00073  Pyruvate carboxylase deficiency\tDBLINKS     GO: 0006099\tREL_PATHWAY map00010  Glycolysis / Gluconeogenesis\tmap00053  Ascorbate and aldarate metabolism

I'm trying to obtain an output containing only some fields, like:

ENTRY   map0010\tNAME Glycolysis\tCLASS   Metabolism\tDISEASE   H00071  Hereditary fructose intolerance H00072  Pyruvate dehydrogenase complex deficiency\tDBLINKS     GO: 0006096 0006094\tNA
ENTRY   map00020\tNAME  Citrate cycle (TCA cycle)\tCLASS   Metabolism; Carbohydrate Metabolism\tDISEASE   H00073  Pyruvate carboxylase deficiency\tDBLINKS     GO: 0006099\tREL_PATHWAY map00010  Glycolysis / Gluconeogenesis\tmap00053  Ascorbate and aldarate metabolism

The main problem is that not all the rows contain the same number of fields, so I need to delete, for example, the fields containing the string "DESCRIPTION", and add an empty field in the rows where the field "CLASS" in not present.

Moreover for some fields the data are split in more than one (f.i, line 1 the field following DISEASE contains disease data!) and I need to join them.

I've tried with:

input = open('file', 'r')

dict = ["ENTRY", "NAME", "CLASS", "DISEASE", "DBLINKS", "REL_PATHWAY"]

split_tab = []
output = []

for line in input:
    split_tab.append(line.split('\t'))

for item in dict:
    for element in split_tab:
        if item in element:
            output.append(element)
        else:
            output.append('\tNA\t')

But it keeps everything, not only the elements specified in dict. Could you please help me?

like image 704
Sonny Avatar asked Dec 22 '22 06:12

Sonny


2 Answers

Use the built in csv library. Your job will be much easier.

For some sample code:

import csv
reader = csv.reader(open('myfile.csv', 'rb'), dialect='excel-tab')
fieldnames = ['Name','Class']
writer = csv.DictWriter(open('myfile.csv', 'rb'), fieldnames, restval='', extrasaction='ignore', dialect='excel-tab')

for row in reader:
    newrow = {}
    for field in row:
        key = field.split(' ', 1)[0]
        newrow[key] = field
    writer.writerow(newrow)

Pay particular attention to how the DictWriter is set up. It is much easier to use if you include the restval and extrasaction fields. They allow you to pass a dictionary with more or less values than the writer is expecting.

Simply have your fieldnames set appropriately, and set up the reader to use the correct dialect. This may include adding your own, but the csv link has instructions on how to do that.

EDIT

After Rob's comment posted below, I've revised this to take into account the fact that csv dialects are not as powerful as I thought.

like image 60
Spencer Rathbun Avatar answered Dec 27 '22 06:12

Spencer Rathbun


requiredKeys = 'ENTRY NAME CLASS DISEASE DBLINKS REL_PATHWAY'.split(' ')

for line in open('file', 'r'):
    fields = line.split('\t')
    fieldMap = {}
    for field in fields:
        key = field.split(' ', 1)[0]
        fieldMap[key] = field
    print '\t'.join([fieldMap.get(key, 'NA') for key in requiredKeys])
like image 21
rob mayoff Avatar answered Dec 27 '22 05:12

rob mayoff