Artificial Intelligence Techniques in Drug Discovery Part 1: Introduction and Chemical Representations

We are continuing our series of Top questions in Artificial Intelligence (AI) drug discovery (read the first in the series here). This blog focuses on the history of AI in drug discovery, the most popular AI techniques used and, and approaches for translating chemical notation into machine-readable data. 

Image courtesy Pexels

Exploring chemical space in search for drugs

Drug discovery campaigns often follow a general template. First, a hypothesis needs to be formulated regarding whether the inhibition or activation of a target—enzymes, receptors, ion channels, etc.—plays a role in a particular disease. Once a target has been identified and validated, scientists explore chemical space in search of molecules with the desired effect on the target. Their findings are then refined until a lead candidate ready for preclinical and clinical trials emerges. 

There are in the order of 10⁶⁰ drug-like compounds. And while we can rely on previous experience, intuition, and serendipity to narrow the search, the task of navigating that chemical space remains daunting. Pharma & biotech spend €570m to develop a successful lead candidate. The process takes 3.5 yrs on average, and each lead candidate has just a 12% chance of making it from first-in-human trials to the market.

Scientists are increasingly relying on computers to support their exploration of chemical space. As AI tools become more powerful, there is hope that augmenting human capabilities with AI will make this exploration more efficient and help bring the drugs to the patients faster.

A brief history of AI in drug discovery

The potential of computational tools to explore and document chemical space has long been apparent. Virtual Screening techniques used to screen large chemical databases like PubChem and ZINC have been exploited since the 1970s. Machine learning (ML) methods began to enter the field in the early 2000s. ML approaches excel at pattern recognition in large complex datasets and have thus attracted enormous interest from the drug discovery community.

ML methods known as Random Forests (RF) have been used for almost two decades to support Virtual Screening efforts and predict molecular structure and function quantitatively. Then, in 2012 the Deep Neural Network (DNN) AlexNet won the world’s foremost visual recognition challenge (ImageNet) head and shoulders over all its competitors, heralding the age of Deep Learning. Later that year, a team of ML graduate students entered the Merck Molecular Activity Challenge with a DNN and outperformed other RF approaches in predicting molecular activities for the first time.


Fingerprints — Molecular fingerprints are a way to represent molecules as mathematical objects. In the most basic realization, each atom in the molecule will be assigned an identifier. Then, each atom identifier is adjusted based on its neighbors. This procedure is followed by removing redundancies and creating a list of identifiers that can be used for further analysis, such as machine learning or AI tools. 

QSAR — or Quantitative structure-activity relationship, is an essential strategy for quantitative chemistry and pharmacy. It is based on the idea that when the structure of a molecule changes, so does the activity. Thus, QSAR can optimize drug design by helping to streamline the search process for the most active compounds in a series. 

RFs — or Random Forests are an ensemble learning method used mainly for classification and regression tasks. Random Forests operate by constructing a multitude of decision trees at training time. The output for classification tasks is selected by identifying the class indicated by most of the trees. For regression, an average of the results returned by the trees.

SMILES — or the simplified molecular-input line-entry system, is a type of notation used to describe chemical structures using short ASCII strings. The SMILES procedure is also reversible; most molecular editors can import a SMILE string and convert it back into two-dimensional drawings or three-dimensional models of the molecules.

Virtual Screening — is a computational technique used in drug discovery to search libraries of small molecules to identify those structures which are most likely to bind to a drug target.

Some recent results highlight the potential of the technology. A team at MIT deployed ML to tackle one of the world’s most pressing challenges—the emergence of antimicrobial-resistant strains—and developed a deep learning approach to antibiotic discovery. In another feat, researchers from Insilico Medicine used deep learning to identify potent DDR1 kinase inhibitors in just 21 days! 

The remarkable advances in ML, particularly in deep learning, have ushered in a new era of drug discovery. Today, more than 230 startups worldwide use ML to power drug discovery programs, and the field is attracting significant private funding. As a result, the total amount of VC funding for AI-biotech startups increased in 2020 by around 23%, compared to 2019, approaching a total of $1.9bn.

AI techniques in drug discovery—a bird’s eye view

There are several ways of classifying ML techniques in drug discovery, but we believe that the best way of navigating the maze is to ask four key questions:

What kind of application am I targeting? 

High-Throughput Screening (HTS) was deployed in the 1980s to accelerate the early phases of drug discovery with automation. One of its most prominent outcomes has been the generation of large structure-activity chemical libraries, several of which are publicly available (like PubChem and ChEMBL). These datasets can be thought of as the part of chemical space that we have already charted. 

These databases are rather large. For instance, PubChem contains over 100 million unique chemical structures with 271 million activity data points from 1.2 million biological assays experiments. Thus, computational approaches—like Virtual Screening—are required to parse through the data searching for promising compounds. 

In essence, these tools have to solve a multi-objective optimization problem. The role of the screening algorithm is to parse through the large database and identify compounds that satisfy several criteria simultaneously. The molecule should have the desired effect on the target, bind selectively, preferably easy to synthesize, etc. Thus, the Virtual Screening engine contains a predictive model that maps the molecular structure to the value of the property in question. This framework is commonly referred to as quantitative structure-activity relationship (QSAR) modeling.

We can take an even bolder approach. Rather than sifting through existing chemical libraries, drug designers take QSAR models and explore uncharted chemical space. This approach, often referred to as de novo drug design, can be viewed as an extension to Virtual Screening and entails the two main applications of AI techniques—molecular property prediction and molecule generation. 

How are molecules presented to the machine? 

There are various ways of representing molecules to make them machine-readable. One of the most common approaches is to use fixed molecular descriptors. First, we start with those that can be obtained from the empirical formula—molecular weight, atom number, etc. Then we build a binary list indicating the presence or absence of a particular group (Figure 1). The resulting list is known as the molecular fingerprint. In principle, we can continue adding further information regarding atom connectivity and molecular topology to build more detailed fingerprints. 

These descriptors have been extremely useful in advancing computational drug discovery and are still widely used in industrial applications. However, they have a critical limitation: fingerprints are fixed and do not represent learnable features. With the advent of ML, we aim at encoding and decoding molecules without relying on hard-coded rules, leading researchers to alternative machine-readable representations of molecules. Two main approaches are Simplified Molecular Input Entry System (SMILES) strings and molecular graphs.

Figure 1. Deriving the SMILES representation
of a chemical molecule,
Here: ciprofloxacin, a fluoroquinolone antibiotic. 
Credit Original by Fdardel, slight edit by DMacks.

SMILES rely on a particular type of syntax for translating molecules into strings. For example, each atom is represented by its atomic symbol (lower case ‘c’ for aromatic carbon), different types of bonds are represented by various characters: single (‘-’), double (‘=’), triple (‘#’), aromatic (‘:’). See (Figure 1) for an example. There are additional syntactic rules to encode aromaticity, branching, etc. SMILES representations have been used since the 1980s and have become an integral part of the computational chemists’ toolkit. However, interest in SMILES has grown over recent years thanks to remarkable advances of AI in language processing tasks.

Figure 2. SMILES representations of melatonins and vanillin.

The SMILES approach is very useful, but it has some important limitations too. The stings do not directly encode atom connections and thus carry less structural information. Moreover, syntactically correct SMILES are not necessarily chemically valid. This process is analogous to chatbots that might generate grammatically correct sentences which make no sense semantically. Think of Noam Chomsky’s famous ‘Colorless green ideas sleep furiously’.

Molecular graphs provide a more intuitive representation. In this approach, atoms are mapped to nodes and bonds to edges. The attributes of each atom and bond are encoded in node and edge feature matrices, respectively. Thus, graphs naturally encode more structural information and are more easily interpretable. 


AI can accelerate many aspects of drug discovery. But to understand the chemical world, AI tools need the right data to analyze. Methods that allow translating chemical and genomics data into the machine-readable text are the foundation of any drug discovery pipeline. And with the right data, AI can start to deliver insights and help discover drugs of the future.

Did you enjoy this blog? Stay tuned for Part 2, where we will focus more on architectures and training approaches for AI in drug discovery.