Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

splitting lines with " from an infile in python

Tags:

python

split

I have a series of input files such as:

chr1    hg19_refFlat    exon    44160380    44160565    0.000000    +   .   gene_id "KDM4A"; transcript_id "KDM4A";
chr1    hg19_refFlat    exon    19563636    19563732    0.000000    -   .   gene_id "EMC1"; transcript_id "EMC1";
chr1    hg19_refFlat    exon    52870219    52870551    0.000000    +   .   gene_id "PRPF38A"; transcript_id "PRPF38A";
chr1    hg19_refFlat    exon    53373540    53373626    0.000000    -   .   gene_id "ECHDC2"; transcript_id "ECHDC2_dup2";
chr1    hg19_refFlat    exon    11839859    11840067    0.000000    +   .   gene_id "C1orf167"; transcript_id "C1orf167";
chr1    hg19_refFlat    exon    29037032    29037154    0.000000    +   .   gene_id "GMEB1"; transcript_id "GMEB1";
chr1    hg19_refFlat    exon    103356007   103356060   0.000000    -   .   gene_id "COL11A1"; transcript_id "COL11A1";

in my code I am trying to capture 2 elements from each line, the first is the number after where it says exon, the second is the gene (the number and letter combo surrounded by "", e.g. "KDM4A". Here is my code:

    with open(infile,'r') as r:
        start = set([line.strip().split()[3] for line in r])
        genes = set([line.split('"')[1] for line in r])
        print len(start)
        print len(genes)

for some reason start works fine but genes is not capturing anything. Here is the output:

 48050
 0

I figure this is something to do with the "" surrounding the gene name but if I enter this on the terminal it works fine:

>>> x = 'A b P "G" m'
>>> x
'A b P "G" m'
>>> x.split('"')[1]
'G'
>>> 

Any solutions would be much appreciated? If even if its a completely different way of capturing the 2 items of data from each line. Thanks

like image 214
user3062260 Avatar asked Sep 16 '15 12:09

user3062260


People also ask

How do you split a specific line in a file in Python?

Also note that the backslash \ allows us to split a single line of Python code across multiple lines of the script). The new command in this script is split . This command is a function of a string, and splits the string into a list of strings. line.

How do you split a paragraph in Python?

Python String | split() separator : This is a delimiter. The string splits at this specified separator. If is not provided then any white space is a separator. maxsplit : It is a number, which tells us to split the string into maximum of provided number of times.

What does readline split do?

readlines() is used to read all the lines at a single go and then return them as each line a string element in a list. This function can be used for small files, as it reads the whole file content to the memory, then split it into separate lines.


1 Answers

It is because your file object is exhausted when you loop over it once here start = set([line.strip().split()[3] for line in r]) again you are trying to loop here genes = set([line.split('"')[1] for line in r]) over the exhausted file object

Solution:

You could seek to the start of the file (this is one of the solutions)

Modification to your code:

with open(infile,'r') as r:
    start = set([line.strip().split()[3] for line in r])
    r.seek(0, 0)
    genes = set([line.split('"')[1] for line in r])
    print len(start)
    print len(genes)
like image 60
The6thSense Avatar answered Oct 31 '22 16:10

The6thSense