I have number of molecules in smiles format and I want to get molecular name from smiles format of molecule and I want to use python for that conversion.
for example :
CN1CCC[C@H]1c2cccnc2 - Nicotine
OCCc1c(C)[n+](=cs1)Cc2cnc(C)nc(N)2 - Thiamin
which python module will help me in doing such conversions?
Kindly let me know.
I don't know of any one module that will let you do this, I had to play at data wrangler to try to get a satisfactory answer.
I tackled this using Wikipedia which is being used more and more for structured bioinformatics / chemoinformatics data, but as it turned out my program reveals that a lot of that data is incorrect.
I used urllib to submit a SPARQL query to dbpedia, first searching for the smiles string and failing that searching for the molecular weight of the compound.
import sys
import urllib
import urllib2
import traceback
import pybel
import json
def query(q,epr,f='application/json'):
try:
params = {'query': q}
params = urllib.urlencode(params)
opener = urllib2.build_opener(urllib2.HTTPHandler)
request = urllib2.Request(epr+'?'+params)
request.add_header('Accept', f)
request.get_method = lambda: 'GET'
url = opener.open(request)
return url.read()
except Exception, e:
traceback.print_exc(file=sys.stdout)
raise e
url = 'http://dbpedia.org/sparql'
q1 = '''
select ?name where {
?s <http://dbpedia.org/property/smiles> "%s"@en.
?s rdfs:label ?name.
FILTER(LANG(?name) = "" || LANGMATCHES(LANG(?name), "en"))
}
limit 10
'''
q2 = '''
select ?name where {
?s <http://dbpedia.org/property/molecularWeight> '%s'^^xsd:double.
?s rdfs:label ?name.
FILTER(LANG(?name) = "" || LANGMATCHES(LANG(?name), "en"))
}
limit 10
'''
smiles = filter(None, '''
CN1CCC[C@H]1c2cccnc2
CN(CCC1)[C@@H]1C2=CC=CN=C2
OCCc1c(C)[n+](=cs1)Cc2cnc(C)nc(N)2
Cc1nnc2CN=C(c3ccccc3)c4cc(Cl)ccc4-n12
CN1C(=O)CN=C(c2ccccc2)c3cc(Cl)ccc13
CCc1nn(C)c2c(=O)[nH]c(nc12)c3cc(ccc3OCC)S(=O)(=O)N4CCN(C)CC4
CC(C)(N)Cc1ccccc1
CN(C)C(=O)Cc1c(nc2ccc(C)cn12)c3ccc(C)cc3
COc1ccc2[nH]c(nc2c1)S(=O)Cc3ncc(C)c(OC)c3C
CCN(CC)C(=O)[C@H]1CN(C)[C@@H]2Cc3c[nH]c4cccc(C2=C1)c34
'''.splitlines())
OBMolecules = {}
for smile in smiles:
try:
OBMolecules[smile] = pybel.readstring('smi', smile)
except Exception as e:
print e
for smi in smiles:
print '--------------'
print smi
try:
print "searching by smiles string.."
results = json.loads(query(q1 % smi, url))
if len(results['results']['bindings']) == 0:
raise Exception('no results from smiles')
else:
print 'NAME: ', results['results']['bindings'][0]['name']['value']
except Exception as e:
print e
try:
mol_weight = round(OBMolecules[smi].molwt, 2)
print "search ing by molecular weight %s" % mol_weight
results = json.loads(query(q2 % mol_weight, url))
if len(results['results']['bindings']) == 0:
raise Exception('no results from molecular weight')
else:
print 'NAME: ', results['results']['bindings'][0]['name']['value']
except Exception as e:
print e
output...
--------------
CN1CCC[C@H]1c2cccnc2
searching by smiles string..
no results from smiles
search ing by molecular weight 162.23
NAME: Anabasine
--------------
CN(CCC1)[C@@H]1C2=CC=CN=C2
searching by smiles string..
no results from smiles
search ing by molecular weight 162.23
NAME: Anabasine
--------------
OCCc1c(C)[n+](=cs1)Cc2cnc(C)nc(N)2
searching by smiles string..
no results from smiles
search ing by molecular weight 267.37
NAME: Pipradrol
--------------
Cc1nnc2CN=C(c3ccccc3)c4cc(Cl)ccc4-n12
searching by smiles string..
no results from smiles
search ing by molecular weight 308.76
no results from molecular weight
--------------
CN1C(=O)CN=C(c2ccccc2)c3cc(Cl)ccc13
searching by smiles string..
no results from smiles
search ing by molecular weight 284.74
NAME: Mazindol
--------------
CCc1nn(C)c2c(=O)[nH]c(nc12)c3cc(ccc3OCC)S(=O)(=O)N4CCN(C)CC4
searching by smiles string..
no results from smiles
search ing by molecular weight 460.55
no results from molecular weight
--------------
CC(C)(N)Cc1ccccc1
searching by smiles string..
no results from smiles
search ing by molecular weight 149.23
NAME: Phenpromethamine
--------------
CN(C)C(=O)Cc1c(nc2ccc(C)cn12)c3ccc(C)cc3
searching by smiles string..
no results from smiles
search ing by molecular weight 307.39
NAME: Talastine
--------------
COc1ccc2[nH]c(nc2c1)S(=O)Cc3ncc(C)c(OC)c3C
searching by smiles string..
no results from smiles
search ing by molecular weight 345.42
no results from molecular weight
--------------
CCN(CC)C(=O)[C@H]1CN(C)[C@@H]2Cc3c[nH]c4cccc(C2=C1)c34
searching by smiles string..
no results from smiles
search ing by molecular weight 323.43
NAME: Lysergic acid diethylamide
As you can see the first two results which should be nicotine come out wrong, this is because the wikipedia entry for nicotine reports the molecular mass in the molecular weight field.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With