So I have a few files that look like:
snpID Gene
rs1 ABC1
rs2 ABC1
rs3 ABC25
rs4 PT4
rs5 MTND24
In different files there will be other snpIDs and Gene pairs but there may be duplicates for a given snpID but the corresponding "Gene" associated could be different . For eg:
snpID Gene
rs100 URX1
rs95 KL4
rs1 ABC1
rs2 ABC1-MHT5
rs3 ABC25
rs4 PT4-FIL42
What I want to do is to append all the contents of the files and remove the duplicates if they have the same snpID and Gene pair. Whereas if the corresponding Gene for a snpID is different it has to go into the same row For the above example it should look like:
snpID Gene
rs1 ABC1
rs2 ABC1, ABC1-MHT5
rs3 ABC25
rs4 PT4, PT4-FIL42
rs5 MTND2
rs100 URX1
rs95 KL4
I thought I could achieve this through creating dictionaries.
import glob
file_list = glob.glob('annotations.*')
dict_snps_genes = {}
for filename in file_list:
with open(filename) as fileA:
for line in fileA:
col0 = line.split()[0]
col1 = line.split()[1]
dict_snps_genes[col0] = col1
unique_dict_snps = {}
for key,value in dict_snps_genes:
if key not in unique_dict_snps.keys():
unique_dict_snps_genes[key] = value
I tested this before moving any further and this gives me an error like:
ValueError: too many values to unpack
PS: each file has around 8000 snpId-Gene pair and there are more than 5 files
Ideas on how to get past this!!
You are looping over keys, but trying to assign those to both a key and value variable:
for key,value in dict_snps_genes:
change that to loop over .items():
for key,value in dict_snps_genes.items():
or better still, if on Python 2.x, use `.iteritems():
for key,value in dict_snps_genes.iteritems():
Note that the way you read the files, you only ever store the last-read gene for any given snpID; you overwrite the previous one if you find another entry for that id.
Personally, I'd use collections.defaultdict() with a set default:
import glob
import collections
file_list = glob.glob('annotations.*')
snps_genes = collections.defaultdict(set)
for filename in file_list:
with open(filename) as fileA:
for line in fileA:
snpid, gene = line.strip().split(None, 1)
snps_genes[snpid].add(gene)
Now the values in snps_genes are sets of genes, each unique. Note that I split your line into 2 pieces on whitespace (.split(None, 1)) so that if there is any whitespace in the gene value, it'll be stored as such:
>>> 'id gene with whitespace'.split(None, 1)
['id', 'gene with whitespace']
By using `snpid, gene' as left-hand assignment expression Python takes the result of the split and assigns each piece to a separate variable; a handy trick here to save a line of code.
To output this to a new file, simply loop over the resulting snps_genes structure. Here is one that sorts everything:
for id in sorted(snps_genes):
print id, ', '.join(sorted(snps_genes[id]))
I would write it as the following:
from glob import glob
import fileinput
infiles = glob('annotations.*')
lines = fileinput.input(infiles)
rows = (line.split() for line in lines)
from collections import defaultdict
dd = defaultdict(list)
for row in rows:
dd[row[0]].append(row[1])
If the values are to be unique, then:
dd = defaultdict(set)
for row in rows:
dd[row[0]].add(row[1])
And then go from there....
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With