With the advent of Artificial Intelligence technology in the field of chemistry, traditional methods based on experiments and physical models are gradually being supplemented with data-driven machine learning paradigms. Ever more data representations are developed for computer processing, which are constantly being adapted to statistical models that are primarily generative.
Although engineering, finance and business will greatly benefit from the new algorithms, the advantages do not stem only from algorithms. Large-scale computing has been an integral part of physical science tools for decades, and some recent advances in Artificial Intelligence have begun to change the way scientific discoveries are made.
There is great enthusiasm for the outstanding achievements in physical sciences, such as the use of machine learning to reproduce images of black holes or the contribution of AlphaFold, an AI programme developed by DeepMind (Alphabet/Google) to predict the 3D structure of proteins.
One of the main goals of chemistry is to understand matter, its properties and the changes it can undergo. For example, when looking for new superconductors, vaccines or any other material with the properties we desire, we turn to chemistry.
We traditionally think chemistry as being practised in laboratories with test tubes, Erlenmeyer flasks (generally graduated containers with a flat bottom, a conical body and a cylindrical neck) and gas burners. In recent years, however, it has also benefited from developments in the fields of computer science and quantum mechanics, both of which became important in the mid-20th century. Early applications included the use of computers to solve calculations of formulas based on physics, or simulations of chemical systems (albeit far from perfect) by combining theoretical chemistry with computer programming. That work eventually developed into the subgroup now known as computational chemistry. This field began to develop in the 1970s, and Nobel Prizes in chemistry were awarded in 1998 to Britain’s John A. Pople (for his development of computational methods in quantum chemistry: the Pariser-Parr-Pople method), and in 2013 to Austria’s Martin Karplus, South Africa’s Michael Levitt, and Israel’s Arieh Warshel for the development of multiscale models for complex chemical systems.
Indeed, although computational chemistry has gained increasing recognition in recent decades, it is far less important than laboratory experiments, which are the cornerstone of discovery.
Nevertheless, considering the current advances in Artificial Intelligence, data-centred technologies and ever-increasing amounts of data, we may be witnessing a shift whereby computational methods are used not only to assist laboratory experiments, but also to guide and orient them.
Hence how does Artificial Intelligence achieve this transformation? A particular development is the application of machine learning to materials discovery and molecular design, which are two fundamental problems in chemistry.
In traditional methods the design of molecules is roughly divided into several stages. It is important to note that each stage can take several years and many resources, and success is by no means guaranteed. The phases of chemical discovery are the following: synthesis, isolation and testing, validation, approval, commercialisation and marketing.
The discovery phase is based on theoretical frameworks developed over centuries to guide and orient molecular design. However, when looking for “useful” materials (e.g. petroleum gel [Vaseline], polytetrafluoroethylene [Teflon], penicillin, etc.), we must remember that many of them come from compounds commonly found in nature. Moreover, the usefulness of these compounds is often discovered only at a later stage. In contrast, targeted research is a more time-consuming and resource-intensive undertaking (and even in this case it may be necessary to use known “useful” compounds as a starting point). Just to give you an idea, the pharmacologically active chemical space (i.e. the number of molecules) has been estimated at 1060! Even before the testing and sizing phases, manual research in such a space can be time-consuming and resource-intensive. Hence how can Artificial Intelligence get into this and speed up the discovery of the chemical substance?
First of all, machine learning improves the existing methods of simulating chemical environments. We have already mentioned that computational chemistry enables to partially avoid laboratory experiments. Nevertheless, computational chemistry calculations simulating quantum-mechanical processes are poor in terms of both computational cost and accuracy of chemical simulations.
A central problem in computational chemistry is solving the 1926 equation of physicist Erwin Schrödinger’s (1887-1961). The scientist described the behaviour of an electron orbiting the nucleus as that of a standing wave. He therefore proposed an equation, called the wave equation, with which to represent the wave associated with the electron. In this respect, the equation is for complex molecules, i.e. given the positions of a set of nuclei and the total number of electrons, the properties of interest must be calculated. Exact solutions are only possible for single-electron systems, while for other systems we must rely on “good enough” approximations. Furthermore, many common methods for approximating the Schrödinger equation scale exponentially, thus making forced solutions difficult to solve. Over time, many methods have been developed to speed up calculations without sacrificing precision too much. However, even some “cheaper” methods can cause computational bottlenecks.
A way in which Artificial Intelligence can accelerate these calculations is by combining them with machine learning. Another approach fully ignores the modelling of physical processes by directly mapping molecular representations onto desired properties. Both methods enable chemists to more efficiently examine databases for various properties, such as nuclear charge, ionisation energy, etc.
While faster calculations are an improvement, they do not solve the issue that we are still confined to known compounds, which account for only a small part of the active chemical space. We still have to manually specify the molecules we want to analyse. How can we reverse this paradigm and design an algorithm to search the chemical space and find suitable candidate substances? The answer may lie in applying generative models to molecular discovery problems.
But before addressing this topic, it is worth talking about how to represent chemical structures numerically (and what can be used for generative modelling). Many representations have been developed in recent decades, most of which fall into one of the four following categories: strings, text files, matrices and graphs.
Chemical structures can obviously be represented as matrices. Matrix representations of molecules were initially used to facilitate searches in chemical databases. In the early 2000s, however, a new matrix representation called Extended Connectivity Fingerprint (ECFP) was introduced. In computer science, the fingerprint or fingerprint of a file is an alphanumeric sequence or string of bits of a fixed length that identifies that file with the intrinsic characteristics of the file itself. The ECFP was specifically designed to capture features related to molecular activity and is often considered one of the first characterisations in the attempts to predict molecular properties.
Chemical structure information can also be transferred into a text file, a common output of quantum chemistry calculations. These text files can contain very rich information, but are generally not very useful as input for machine learning models. On the other hand, the string representation encodes a lot of information in its syntax. This makes them particularly suitable for generative modelling, just like text generation. Finally, the graph-based representation is more natural. It not only enables us to encode specific properties of the atom in the node embeddings, but also captures chemical bonds in the edge embeddings. Furthermore, when combined with message exchange, graph-based representation enables us to interpret (and configure) the influence of one node on another node by its neighbours, which reflects the way atoms in a chemical structure interact with each other. These properties make graph-based representations the preferred type of input representation for deep learning models. (1. continued)