Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find matches in two file using python

Tags:

python

match

I am analyzing sequencing data and I have few candidates genes that I need to find their functions.

After editing the available human database , I want to compare my candidate genes with the database and output the function for my candidate gene.

I have only basic python skills so I thought this might help me to speed up my work finding the functions of my candidate genes.

so file1 which contains the candidate genes look like this

Gene
AQP7
RLIM
SMCO3
COASY
HSPA6

and the database,file2.csv looks like this:

Gene   function 
PDCD6  Programmed cell death protein 6 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a

desired output

 Gene(from file1) ,function(matching from file2)

i tried to use this code :

file1 = 'file1.csv'
file2 = 'file2.csv'
output = 'file3.txt'

with open(file1) as inf:
    match = set(line.strip() for line in inf)

with open(file2) as inf, open(output, 'w') as outf:
    for line in inf:
        if line.split(' ',1)[0] in match:
            outf.write(line)

I only get blank page.

I tried using intersection function

with open('file1.csv', 'r') as ref:
    with open('file2.csv','r') as com:
       with open('common_genes_function','w') as output:
           same = set(ref).intersection(com)
                print same

not working also..

Please help otherwise I need to do this manually

like image 636
Jan Shamsani Avatar asked Jun 27 '26 05:06

Jan Shamsani


2 Answers

I would recommend using pandas merge function. However, it requires a clear separator between the 'Gene' and 'function'-column. In my example, I assume it is at tab:

import pandas as pd
#open files as pandas datasets
file1 = pd.read_csv(filepath1, sep = '\t')
file2 = pd.read_csv(filepath2, sep = '\t')

#merge files by column 'Gene' using 'inner', so it comes up
#with the intersection of both datasets
file3 = pd.merge(file1, file2, how = 'inner', on = ['Gene'], suffixes = ['1','2'])
file3.to_csv(filepath3, sep = ',')
like image 70
RaJa Avatar answered Jun 28 '26 19:06

RaJa


Using basic Python, you can try the following:

import re

gene_function = {}
with open('file2.csv','r') as input:
    lines = [line.strip() for line in input.readlines()[1:]]
    for line in lines:
        match = re.search("(\w+)\s+(.*)",line)
        gene = match.group(1)
        function = match.group(2)
        if gene not in gene_function:
            gene_function[gene] = function

with open('file1.csv','r') as input:
    genes = [i.strip() for i in input.readlines()[1:]]
    for gene in genes:
        if gene in gene_function:
            print "{}, {}".format(gene, gene_function[gene])
like image 26
MervS Avatar answered Jun 28 '26 18:06

MervS



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!