I have a series of input files such as:
chr1 hg19_refFlat exon 44160380 44160565 0.000000 + . gene_id "KDM4A"; transcript_id "KDM4A";
chr1 hg19_refFlat exon 19563636 19563732 0.000000 - . gene_id "EMC1"; transcript_id "EMC1";
chr1 hg19_refFlat exon 52870219 52870551 0.000000 + . gene_id "PRPF38A"; transcript_id "PRPF38A";
chr1 hg19_refFlat exon 53373540 53373626 0.000000 - . gene_id "ECHDC2"; transcript_id "ECHDC2_dup2";
chr1 hg19_refFlat exon 11839859 11840067 0.000000 + . gene_id "C1orf167"; transcript_id "C1orf167";
chr1 hg19_refFlat exon 29037032 29037154 0.000000 + . gene_id "GMEB1"; transcript_id "GMEB1";
chr1 hg19_refFlat exon 103356007 103356060 0.000000 - . gene_id "COL11A1"; transcript_id "COL11A1";
in my code I am trying to capture 2 elements from each line, the first is the number after where it says exon, the second is the gene (the number and letter combo surrounded by "", e.g. "KDM4A". Here is my code:
with open(infile,'r') as r:
start = set([line.strip().split()[3] for line in r])
genes = set([line.split('"')[1] for line in r])
print len(start)
print len(genes)
for some reason start works fine but genes is not capturing anything. Here is the output:
48050
0
I figure this is something to do with the "" surrounding the gene name but if I enter this on the terminal it works fine:
>>> x = 'A b P "G" m'
>>> x
'A b P "G" m'
>>> x.split('"')[1]
'G'
>>>
Any solutions would be much appreciated? If even if its a completely different way of capturing the 2 items of data from each line. Thanks
Also note that the backslash \ allows us to split a single line of Python code across multiple lines of the script). The new command in this script is split . This command is a function of a string, and splits the string into a list of strings. line.
Python String | split() separator : This is a delimiter. The string splits at this specified separator. If is not provided then any white space is a separator. maxsplit : It is a number, which tells us to split the string into maximum of provided number of times.
readlines() is used to read all the lines at a single go and then return them as each line a string element in a list. This function can be used for small files, as it reads the whole file content to the memory, then split it into separate lines.
It is because your file object is exhausted when you loop over it once here start = set([line.strip().split()[3] for line in r])
again you are trying to loop here genes = set([line.split('"')[1] for line in r])
over the exhausted file object
Solution:
You could seek to the start of the file (this is one of the solutions)
Modification to your code:
with open(infile,'r') as r:
start = set([line.strip().split()[3] for line in r])
r.seek(0, 0)
genes = set([line.split('"')[1] for line in r])
print len(start)
print len(genes)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With