Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

append multiple files and remove duplicates using dictionaries

So I have a few files that look like:

snpID  Gene
rs1  ABC1
rs2  ABC1
rs3  ABC25
rs4  PT4
rs5  MTND24

In different files there will be other snpIDs and Gene pairs but there may be duplicates for a given snpID but the corresponding "Gene" associated could be different . For eg:

snpID  Gene
rs100  URX1
rs95  KL4
rs1  ABC1
rs2  ABC1-MHT5
rs3  ABC25
rs4  PT4-FIL42

What I want to do is to append all the contents of the files and remove the duplicates if they have the same snpID and Gene pair. Whereas if the corresponding Gene for a snpID is different it has to go into the same row For the above example it should look like:

snpID  Gene
rs1  ABC1
rs2  ABC1, ABC1-MHT5
rs3  ABC25
rs4  PT4, PT4-FIL42
rs5  MTND2
rs100  URX1
rs95  KL4

I thought I could achieve this through creating dictionaries.

import glob
file_list = glob.glob('annotations.*')
dict_snps_genes = {}
for filename in file_list:
    with open(filename) as fileA:
        for line in fileA:
            col0 = line.split()[0]
            col1 = line.split()[1]
            dict_snps_genes[col0] = col1 

unique_dict_snps = {}
for key,value in dict_snps_genes:
    if key not in unique_dict_snps.keys():
        unique_dict_snps_genes[key] = value

I tested this before moving any further and this gives me an error like:

ValueError: too many values to unpack

PS: each file has around 8000 snpId-Gene pair and there are more than 5 files

Ideas on how to get past this!!

like image 742
jules Avatar asked Dec 04 '25 17:12

jules


2 Answers

You are looping over keys, but trying to assign those to both a key and value variable:

for key,value in dict_snps_genes:

change that to loop over .items():

for key,value in dict_snps_genes.items():

or better still, if on Python 2.x, use `.iteritems():

for key,value in dict_snps_genes.iteritems():

Note that the way you read the files, you only ever store the last-read gene for any given snpID; you overwrite the previous one if you find another entry for that id.

Personally, I'd use collections.defaultdict() with a set default:

import glob
import collections

file_list = glob.glob('annotations.*')
snps_genes = collections.defaultdict(set)
for filename in file_list:
    with open(filename) as fileA:
        for line in fileA:
            snpid, gene = line.strip().split(None, 1)
            snps_genes[snpid].add(gene)

Now the values in snps_genes are sets of genes, each unique. Note that I split your line into 2 pieces on whitespace (.split(None, 1)) so that if there is any whitespace in the gene value, it'll be stored as such:

>>> 'id gene with whitespace'.split(None, 1)
['id', 'gene with whitespace']

By using `snpid, gene' as left-hand assignment expression Python takes the result of the split and assigns each piece to a separate variable; a handy trick here to save a line of code.

To output this to a new file, simply loop over the resulting snps_genes structure. Here is one that sorts everything:

for id in sorted(snps_genes):
    print id, ', '.join(sorted(snps_genes[id]))
like image 189
Martijn Pieters Avatar answered Dec 06 '25 08:12

Martijn Pieters


I would write it as the following:

from glob import glob
import fileinput

infiles = glob('annotations.*')
lines = fileinput.input(infiles)
rows = (line.split() for line in lines)

from collections import defaultdict
dd = defaultdict(list)
for row in rows:
    dd[row[0]].append(row[1])

If the values are to be unique, then:

dd = defaultdict(set)
for row in rows:
    dd[row[0]].add(row[1])

And then go from there....

like image 20
Jon Clements Avatar answered Dec 06 '25 07:12

Jon Clements



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!