Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RDKit: how to check molecules for exact match?

I'm using RDKit and trying to check molecules for exact match. After using Chem.MolFromSmiles() the expression m == p apparently doesn't lead to the desired result. Of course, I can check whether p is a substructure of m and whether m is a substructure of p. But to me this looks too complicated. I couldn't find or overlooked a code example for exact match in the RDKit-documentation. How do I do this correctly? Thank you for hints.

Code:

from rdkit import Chem

myPattern = 'c1ccc2c(c1)c3ccccc3[nH]2'          # Carbazole
myMolecule = 'C1=CC=C2C(=C1)C3=CC=CC=C3N2'      # Carbazole

m = Chem.MolFromSmiles(myMolecule)
p = Chem.MolFromSmiles(myPattern)

print(m == p)                    # returns False, first (unsuccessful) attempt to check for identity

print(m.HasSubstructMatch(p))    # returns True
print(p.HasSubstructMatch(m))    # returns True
print(m.HasSubstructMatch(p) and p.HasSubstructMatch(m))    # returns True, so are the molecules identical?
like image 713
theozh Avatar asked Feb 13 '20 15:02

theozh


People also ask

What is RDKit fingerprint?

Generates hashed bit-based fingerprints for an input RDKit Mol column and appends them to the table. Several fingerprint types are available. Not all settings are used for each type. Settings that are not supported by a fingerprint type will be disabled/hidden and will have no effect.

What is RDKit used for?

RDKit is an open source toolset used in cheminformatics. It features the following: Business-friendly BSD license. Core data structures and algorithms in C++

What is RDKit in Python?

RDKit is a collection of cheminformatics and machine-learning software written in C++ and Python. BSD license - a business friendly license for open source. Core data structures and algorithms in C++ Python 3.x wrapper generated using Boost.Python.


2 Answers

To check if two different SMILES represent the same molecule you can canonicalize the SMILES.

from rdkit import Chem

myPattern = 'c1ccc2c(c1)c3ccccc3[nH]2'
myMolecule = 'C1=CC=C2C(=C1)C3=CC=CC=C3N2'

a = Chem.CanonSmiles(myPattern)
b = Chem.CanonSmiles(myMolecule)

print(a)
'c1ccc2c(c1)[nH]c1ccccc12'

print(b)
'c1ccc2c(c1)[nH]c1ccccc12'

print(a==b)
True
like image 98
rapelpy Avatar answered Sep 28 '22 04:09

rapelpy


My RDKit knowledge isn't great and their documentation is famously terrible but I have done this kind of thing myself. A (perhaps over-engineered) method would be to generate a graph with networkx and just compare the nodes and edges.

This is surprisingly simple, using rdkit to read the file/smiles string then just generate the topology on the fly. If you generate an rdkit_mol object from a smiles string as you have above, you would then do:

import networkx as nx


def topology_from_rdkit(rdkit_molecule):

    topology = nx.Graph()
    for atom in rdkit_molecule.GetAtoms():
        # Add the atoms as nodes
        topology.add_node(atom.GetIdx())

        # Add the bonds as edges
        for bonded in atom.GetNeighbors():
            topology.add_edge(atom.GetIdx(), bonded.GetIdx())

    return topology


def is_isomorphic(topology1, topology2):
    return nx.is_isomorphic(topology1, topology2)
like image 24
QuantumChris Avatar answered Sep 28 '22 06:09

QuantumChris