to get molecular name from smiles format using python

Question

I have number of molecules in smiles format and I want to get molecular name from smiles format of molecule and I want to use python for that conversion.

for example :

CN1CCC[C@H]1c2cccnc2 - Nicotine  
OCCc1c(C)[n+](=cs1)Cc2cnc(C)nc(N)2 - Thiamin

which python module will help me in doing such conversions?
Kindly let me know.

Tim · Accepted Answer

I don't know of any one module that will let you do this, I had to play at data wrangler to try to get a satisfactory answer.

I tackled this using Wikipedia which is being used more and more for structured bioinformatics / chemoinformatics data, but as it turned out my program reveals that a lot of that data is incorrect.

I used urllib to submit a SPARQL query to dbpedia, first searching for the smiles string and failing that searching for the molecular weight of the compound.

import sys
import urllib
import urllib2
import traceback
import pybel
import json

def query(q,epr,f='application/json'):
    try:
        params = {'query': q}
        params = urllib.urlencode(params)
        opener = urllib2.build_opener(urllib2.HTTPHandler)
        request = urllib2.Request(epr+'?'+params)
        request.add_header('Accept', f)
        request.get_method = lambda: 'GET'
        url = opener.open(request)
        return url.read()
    except Exception, e:
        traceback.print_exc(file=sys.stdout)
        raise e 

url = 'http://dbpedia.org/sparql'

q1 = '''
select ?name where {
    ?s <http://dbpedia.org/property/smiles> "%s"@en.
    ?s rdfs:label ?name.
    FILTER(LANG(?name) = "" || LANGMATCHES(LANG(?name), "en"))
}
limit 10
'''
q2 = '''
select ?name where {
    ?s <http://dbpedia.org/property/molecularWeight> '%s'^^xsd:double.
    ?s rdfs:label ?name.
    FILTER(LANG(?name) = "" || LANGMATCHES(LANG(?name), "en"))
}
limit 10
'''

smiles = filter(None, '''

CN1CCC[C@H]1c2cccnc2
CN(CCC1)[C@@H]1C2=CC=CN=C2

OCCc1c(C)[n+](=cs1)Cc2cnc(C)nc(N)2

Cc1nnc2CN=C(c3ccccc3)c4cc(Cl)ccc4-n12

CN1C(=O)CN=C(c2ccccc2)c3cc(Cl)ccc13

CCc1nn(C)c2c(=O)[nH]c(nc12)c3cc(ccc3OCC)S(=O)(=O)N4CCN(C)CC4

CC(C)(N)Cc1ccccc1

CN(C)C(=O)Cc1c(nc2ccc(C)cn12)c3ccc(C)cc3

COc1ccc2[nH]c(nc2c1)S(=O)Cc3ncc(C)c(OC)c3C

CCN(CC)C(=O)[C@H]1CN(C)[C@@H]2Cc3c[nH]c4cccc(C2=C1)c34
'''.splitlines())

OBMolecules = {}
for smile in smiles:
    try:
        OBMolecules[smile] = pybel.readstring('smi', smile)
    except Exception as e:
        print e

for smi in smiles:
    print '--------------'
    print smi
    try:
        print "searching by smiles string.."
        results = json.loads(query(q1 % smi, url))
        if len(results['results']['bindings']) == 0:
            raise Exception('no results from smiles')
        else:
            print 'NAME: ', results['results']['bindings'][0]['name']['value']

    except Exception as e:
        print e

        try:
            mol_weight = round(OBMolecules[smi].molwt, 2)
            print "search ing by molecular weight %s" % mol_weight
            results = json.loads(query(q2 % mol_weight, url))
            if len(results['results']['bindings']) == 0:
                raise Exception('no results from molecular weight')
            else:
                print 'NAME: ', results['results']['bindings'][0]['name']['value']
        except Exception as e:
            print e

output...

--------------
CN1CCC[C@H]1c2cccnc2
searching by smiles string..
no results from smiles
search ing by molecular weight 162.23
NAME:  Anabasine
--------------
CN(CCC1)[C@@H]1C2=CC=CN=C2
searching by smiles string..
no results from smiles
search ing by molecular weight 162.23
NAME:  Anabasine
--------------
OCCc1c(C)[n+](=cs1)Cc2cnc(C)nc(N)2
searching by smiles string..
no results from smiles
search ing by molecular weight 267.37
NAME:  Pipradrol
--------------
Cc1nnc2CN=C(c3ccccc3)c4cc(Cl)ccc4-n12
searching by smiles string..
no results from smiles
search ing by molecular weight 308.76
no results from molecular weight
--------------
CN1C(=O)CN=C(c2ccccc2)c3cc(Cl)ccc13
searching by smiles string..
no results from smiles
search ing by molecular weight 284.74
NAME:  Mazindol
--------------
CCc1nn(C)c2c(=O)[nH]c(nc12)c3cc(ccc3OCC)S(=O)(=O)N4CCN(C)CC4
searching by smiles string..
no results from smiles
search ing by molecular weight 460.55
no results from molecular weight
--------------
CC(C)(N)Cc1ccccc1
searching by smiles string..
no results from smiles
search ing by molecular weight 149.23
NAME:  Phenpromethamine
--------------
CN(C)C(=O)Cc1c(nc2ccc(C)cn12)c3ccc(C)cc3
searching by smiles string..
no results from smiles
search ing by molecular weight 307.39
NAME:  Talastine
--------------
COc1ccc2[nH]c(nc2c1)S(=O)Cc3ncc(C)c(OC)c3C
searching by smiles string..
no results from smiles
search ing by molecular weight 345.42
no results from molecular weight
--------------
CCN(CC)C(=O)[C@H]1CN(C)[C@@H]2Cc3c[nH]c4cccc(C2=C1)c34
searching by smiles string..
no results from smiles
search ing by molecular weight 323.43
NAME:  Lysergic acid diethylamide

As you can see the first two results which should be nicotine come out wrong, this is because the wikipedia entry for nicotine reports the molecular mass in the molecular weight field.

to get molecular name from smiles format using python

Tags:

python

bioinformatics

sam

1 Answers

Tim

Recent Activity

Donate For Us

to get molecular name from smiles format using python

Tags:

python

bioinformatics

sam

1 Answers

Tim

Related questions

Recent Activity

Donate For Us