splitting lines with " from an infile in python

Tags:

split

I have a series of input files such as:

chr1    hg19_refFlat    exon    44160380    44160565    0.000000    +   .   gene_id "KDM4A"; transcript_id "KDM4A";
chr1    hg19_refFlat    exon    19563636    19563732    0.000000    -   .   gene_id "EMC1"; transcript_id "EMC1";
chr1    hg19_refFlat    exon    52870219    52870551    0.000000    +   .   gene_id "PRPF38A"; transcript_id "PRPF38A";
chr1    hg19_refFlat    exon    53373540    53373626    0.000000    -   .   gene_id "ECHDC2"; transcript_id "ECHDC2_dup2";
chr1    hg19_refFlat    exon    11839859    11840067    0.000000    +   .   gene_id "C1orf167"; transcript_id "C1orf167";
chr1    hg19_refFlat    exon    29037032    29037154    0.000000    +   .   gene_id "GMEB1"; transcript_id "GMEB1";
chr1    hg19_refFlat    exon    103356007   103356060   0.000000    -   .   gene_id "COL11A1"; transcript_id "COL11A1";

in my code I am trying to capture 2 elements from each line, the first is the number after where it says exon, the second is the gene (the number and letter combo surrounded by "", e.g. "KDM4A". Here is my code:

Click to copy

    with open(infile,'r') as r:
        start = set([line.strip().split()[3] for line in r])
        genes = set([line.split('"')[1] for line in r])
        print len(start)
        print len(genes)

for some reason start works fine but genes is not capturing anything. Here is the output:

Click to copy

 48050
 0

I figure this is something to do with the "" surrounding the gene name but if I enter this on the terminal it works fine:

Click to copy

>>> x = 'A b P "G" m'
>>> x
'A b P "G" m'
>>> x.split('"')[1]
'G'
>>>

Any solutions would be much appreciated? If even if its a completely different way of capturing the 2 items of data from each line. Thanks

214

asked Sep 16 '15 12:09

user3062260

1 Answers

It is because your file object is exhausted when you loop over it once here start = set([line.strip().split()[3] for line in r]) again you are trying to loop here genes = set([line.split('"')[1] for line in r]) over the exhausted file object

Solution:

You could seek to the start of the file (this is one of the solutions)

Modification to your code:

Click to copy

with open(infile,'r') as r:
    start = set([line.strip().split()[3] for line in r])
    r.seek(0, 0)
    genes = set([line.split('"')[1] for line in r])
    print len(start)
    print len(genes)

answered Oct 31 '22 16:10

The6thSense

Related questions
                            
                                Two Flask Applications at same time
                            
                                save a plot resulting from a function matplotlib python
                            
                                Python Voice Recognition Library - Always Listen?
                            
                                Efficiently find repeated characters in a string
                            
                                Python - check if a letter is in a list
                            
                                Convert list to comma separate values in django template
                            
                                How do I detect long blocking functions in Tornado application
                            
                                Drawing with turtle(python) using PyCharm
                            
                                Python Queue.join()
                            
                                Python Regex Subgroup Capturing
                            
                                Password Protect one webpage in Flask app
                            
                                How to draw subgraph using networkx
                            
                                How to skip `if __name__ == "__main__"` in interactive mode?
                            
                                python: How to get real feature name from feature_importances
                            
                                matplotlib.pyplot scatterplot legend from color dictionary
                            
                                graphite/carbon ImportError: No module named fields
                            
                                Django request data returns str instead of list
                            
                                PyBluez 'module object has no attribute 'discover_devices'
                            
                                Objects are not considered the same in Dictionary keys - but __eq__ is implemented
                            
                                How to define lazy variable in Python which will raise NotImplementedError for abstract code skeleton?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

splitting lines with " from an infile in python

Tags:

python

split

user3062260

People also ask

1 Answers

The6thSense

Recent Activity

Donate For Us