Artificial Intelligence Techniques in Drug Discovery Part 2: Architectures and Training

In our previous blog, AI techniques in drug discovery, part 1, we discussed the history of applying AI in drug discovery and considered how molecular data has to be prepared for use with these methods. In this Part 2 we focus on the practical aspects of selecting the right type of AI tool to tackle a specific problem as well as  the different approaches for training AI models.

Credit: Pixabay

Which model architecture should I use? 

In essence, machine learning is a way of generating problem-solving frameworks that improve by learning. When building these systems, we need to consider two things: its functional form and how we are going to train it. Several different models have been used—and continue to be used—like Random Forests (RFs), K-nearest neighbors, and (KNN), naive Bayes (NB). In this blog, we focus on Deep Neural Networks (DNN), which have shown enormous promise in recent years. 

From the early stages of the deep learning revolution, it became apparent that certain types of neural networks were more suitable for different types of questions. Thus, it is very important to select the right architecture for the problem at hand.

For instance, recurrent neural networks (RNNs) are ideally suited for processing sequential data. And so, they have been used to solve language modeling problems and generate synthetic text and music. It is only natural to think that these architectures can be used to tackle similar problems in the SMILES molecular language. Indeed, RNNs have successfully generated compounds binding Dopamine Receptor Type 2 using SMILES.

In the last few years, deep learning generative models have garnered more and more interest. These are models capable of producing highly realistic synthetic content such as images, texts, and even molecules. Two approaches that have gathered significant attention are Variational Autoencoders (VAE) and Generative Adversarial Networks (GANs).

VAEs are composed of two parts: an encoder and a decoder. The encoder compresses information, and then the decoder tries to recover the original. The system is trained to learn how to lose as little information as possible during the encoding-decoding process. This way, the system learns how to keep just the most essential information hidden in the data. Tools applying VAEs to SMILES for automatic molecule generation have been deployed with some degree of success. Indeed, a VAE lies at the core of the engine used by Insilico Medicine for the fast development of a DDR1 Kinase inhibitor. 

GANs have two competing agents–or adversaries—a generator and a discriminator. The generator creates ‘fake’ data points, and the discriminator has to spot them. Training is basically a continuous competition, each of the adversaries gets better at their job trying to beat the other one, and as a consequence of that the generated data becomes ever more realistic—somehow like trying to win a drawing competition. 

GANs have primarily been used to create synthetic images and videos, including the infamous deepfakes. But GANs can also be applied on SMILES to generate molecules. A prominent example is ORGANIC (objective-reinforced generative adversarial network for inverse-design chemistry) which uses SMILES to generate distributions of molecules that optimize for particular properties such as solubility or synthesizability. MolGAN instead uses GANs to generate molecular graphs in a computationally efficient manner. 


DNN — or a Deep Neural Network, is a type of artificial neural network. A DNN contains multiple layers between the input and output layers. The presence of the deep layers in this type of neural network is especially advantageous for the analysis of complex data, as they are more efficient requiring fewer units than a similarly performing shallow network.

GAN — or Generative Adversarial Network, is a type of artificial neural network. In this approach, two neural networks compete with each other in a zero-sum game, where one agent’s gain is another agent’s loss.

Reinforcement learning — is a machine learning training method that uses a scoring system that assigns points to desired behaviors (or punishes undesired ones by score lowering).  based on rewarding desired behaviors and/or punishing undesired ones. This means that a model trained in reinforcement learning should be able to perceive and interpret its environment, take actions, and learn through trial and error.

RFs — or Random Forests are an ensemble learning method used mainly for classification and regression tasks. Random Forests operate by constructing a multitude of decision trees at training time. The output for classification tasks is selected by identifying the class that was indicated by the largest number of trees, and for regression, an average of the results returned by the trees.

RNN — or Recurrent Neural Network, is a class of artificial neural networks. In RNN  connections between nodes form a directed graph along a temporal sequence. This property allows it to exhibit temporal dynamic behavior.

SMILES — or the simplified molecular-input line-entry system, is a type of notation used to describe chemical structures using short ASCII strings. The SMILES procedure is also reversible; most molecular editors can import a SMILE string and convert it back into two-dimensional drawings or three-dimensional models of the molecules.

Supervised learning — is a machine learning framework that maps an input variable to a corresponding output variable. This results in a function inferred from the data.

Unsupervised learning — is a machine learning framework that analyses unlabelled data without an output variable. Thus, unsupervised learning algorithms self-discover patterns or unique characteristics to explain the data set.

VAE — or Variational Autoencoder, is a type of artificial neural network architecture. VAE is a type of probabilistic graphical model also using variational Bayesian methods.

DMTA — or Design-Make-Test-Analysis, is the central process of drug discovery. This process enables hypothesis construction and testing, conducting experiments, and analyzing the associated data for new findings and information. And each step of the process relies on the inputs and outputs of the other three components.

In recent years a new architecture has taken ML by storm. In 2017, the groundbreaking paper titled ‘Attention Is All You Need’ introduced the transformer architecture, which exploits a mechanism known as self-attention. Transformers were originally invented to do machine translation and are the engine that drives powerful language models like BERT and GPT-3. But transformers proved to have much broader applications. Transformers were a fundamental part to what is likely to be one the most significant scientific discoveries of the decade, the solution of the protein folding problem by DeepMind’s AlphaFold 2. Several groups are proposing different approaches to deploy transformers for drug discovery. ChemBERTa and SMILES-BERT use BERT transformers and SMILES to create molecular property prediction engines. Meanwhile, GROVER deploys transformers on graph representations and has shown excellent performance.

How to train your model 

Now comes one of the most challenging steps in Machine Learning, the learning itself. This part of the workflow is powered by data. Ideally, you would like to feed your model with well-curated and suitable labeled data, but this is hard to come by. The deployment of automated laboratories that can perform and track hundreds of experiments per day is helping alleviate this problem. Moreover, new learning paradigms that make more efficient use of data, such as self-supervised learning, active learning, and reinforcement learning, are becoming mainstream. 

Self-supervised learning aims at filling out the missing data by ‘augmenting’ the dataset. The training used for BERT sets the canonical example. In BERT, you teach the model how to autocomplete sentences by masking words and training the model to fill in the blanks. A similar approach can be used with molecules. In particular, self-learning has been used to augment molecular graph datasets by performing atom masking, bond deletion, and subgraph removal. An interesting example is provided by MolGNN, which has been deployed for self-labeling graph motifs (recurrent structures) with great success.

Reinforcement learning has demonstrated immense potential for molecule generation. Reinforcement learning deals with the question of how an agent should act so as to maximize rewards. Reinforcement learning has been critical for the development of astonishing game-playing algorithms like AlphaGo and AlphaZero, which can learn by playing against themselves. Reinforcement learning can be used to mimic the Design-Make-Test-Analysis (DMTA) cycles underlying drug design. In this case, the agent generates a molecule which is then tested and improved iteratively. 

Reinforcement learning has been used for designing SMILES chemical libraries with predetermined properties. A good example is ReLeaSE (Reinforcement Learning for Structural Evolution) which has been used to develop libraries with biases towards specific physical properties, such as melting point or hydrophobicity, or toward compounds with inhibitory activity against Janus protein kinase 2. Moreover, the aforementioned Insilico Medicine algorithm for identifying DDR1 kinase inhibitors was trained using Reinforcement learning.


AI can augment and accelerate the drug discovery process. This group of techniques has already proven its value in this context: from finding new drug targets, designing drug-like molecules, to running testing and validation in the virtual lab. However, there are still challenges ahead. Most importantly, AI models require good quality data to make accurate predictions. And currently, a great portion of data in drug discovery is collected manually, resulting in sparse data and human errors. What’s more, to be truly useful, AI has to be explainable—only then apart from generating predictions, AI will be able to start generating scientific understanding on the underlying biological or chemical mechanism of the disease.

If you enjoyed this blog, do check out our other blogs in our blog series for most commonly asked questions about global trends in AI for drug discovery. Watch this space for our next blog, where we will cover the most promising  start-ups, Big Tech & Pharma companies within the AI-enabled drug discovery industry.