As previously seen, chemical representation types have developed several sub-types over recent years. Unfortunately, however, there is no clear answer as to which representation is the most efficient for a particular problem. For example, matrix representations are often the first choice for attribute prediction but, in recent years, graphs have also emerged as strong alternatives. It is also important to note that we can combine several types of representations depending on the problem.
Hence how (and which) representations can be used to explore chemical space? We have already said that string representations are suitable for generative modelling. Initially, graphical representations were not easy to model by using generative models, but more recently their combination with the Varational Autoencoder (VAE) has made them a very attractive factor.
In machine learning a variational autoencododer is an artificial neural network architecture introduced by Diederik P. Kingma e Max Welling. It is part of the families of probabilistic graphical models and variational Baysenian methods (i.e. family of methods for the approximation of integrals).
VAEs have proved particularly useful since they enable us to have a more machine-readable continuous representation. A study used VAEs to show that both string and graph representations can be encoded and decoded in a space where molecules are no longer discrete, but can be decoded into continuous vectors with real values of molecule representations. The Euclidean distance between different vectors will correspond to chemical similarity. Another model is added between the encoder and the decoder to predict the attribute to be reached at any point in space.
But while generating molecules per se is a simple task – we can take any generative model and apply it to the representation we desire – generating structures that are chemically valid and display the properties we desire is a much more challenging issue.
The initial approaches to achieve this goal imply models on existing data sets and their subsequent use for transfer to learning. The model is fine-tuned through a calibration data set to enable the generation of structures oriented towards specific properties, which can then be further calibrated using various algorithms. Many examples of this imply the use of string representations or graphs. However, difficulties are encountered with respect to the chemical validity or desired properties when these are not successfully obtained. Furthermore, the fact of relying on data sets limits the search space and introduces potentially undesirable biases.
An attempt at improvement is to use Markov Decision Process (MDP) to ensure the validity of chemical structures and optimise the MDP itself to achieve the desired properties through deep Q-learning (a model-free reinforcement learning algorithm to derive the value of an action in a particular state). In mathematics, an MDP is a discrete-time stochastic control process (a function or signal, with values given at a chosen set of times in the integer domain). It provides a mathematical framework for modelling the decision-making process in situations where outcomes are partly random and partly under the control of a decision-maker. MDPs are useful for studying optimisation problems solved by means of programming. They are used in many disciplines, including robotics, automatic control, economics and manufacturing. The MDP is named after the Russian mathematician Andrej Andreevič Markov (1856-1922).
A particular advantage of this model is that it enables users to visualise the preference of different actions: (a) to visualise the degree of preference for certain actions (1 being the highest preference, 0 the least preferred); and (b) take steps to maximise the quantitative estimation of the drug similarity to the starting molecule.
Although still in its infancy, the use of Artificial Intelligence to explore the chemical space is already showing great promise. It provides us with a new paradigm to explore the chemical space and a new way to test theories and hypotheses. Although empiricism is not as accurate as experimental research, computationally-based methods will remain an active research area for the foreseeable future and will already be part of any research group.
So far we have seen how Artificial Intelligence can help discover new chemicals more quickly by exploiting generative algorithms to search the chemical space. Although this is one of the most noteworthy use cases, there are also others. Artificial Intelligence is being applied to many other problems in chemistry, including:
1. Automated work in laboratory. Machine learning techniques can be used to speed up synthesis workflows. An approach uses self-driving laboratories to automate routine tasks, optimise resource expenditure and save time. A relatively new but noteworthy example is the use of the Ada robotic platform to automate the synthesis, processing and characterisation of materials. Ada tools are developed to provide predictions and models to automate repetitive processes, using machine learning and AI technologies to collect, understand and process data, so that resources can be dedicated to more value-added activities.
Ada is basically a laboratory that discovers and develops new organic thin-film materials without any human supervision. Its productivity is making most recent graduates uncomfortable. The entire thin-film fabrication cycle, from the mixing of chemical precursors, through deposition and thermal annealing, to the final electrical and optical characterisation, takes only twenty minutes. An additional aid is the use of a mobile chemical robot that can operate tools and perform measurements on 688 experiments over eight days.
2. Chemical reaction prediction. Classification models can be used to predict the type of reaction that will occur, or simplify the problem and predict whether a certain chemical reaction will occur.
3. Chemical data mining. Chemistry, like many other disciplines, has an extensive scientific literature for the study of trends and correlations. A notable example is the data mining of the vast amounts of information provided by the Human Genome Project to identify trends in genomic data.
4. Finally, although the new data-driven trend is developing rapidly and has had a great impact, it also entails many new challenges, including the gap between computation and experiment. Although computational methods aim to help achieve the experiment goals, the results of the former are not always transferable to the latter. For example, when using machine learning to find candidate molecules, we have to bear in mind that molecules are rarely unique in their synthetic pathways, and it is often difficult to know whether an unexplored chemical reaction will work in practice. Even if it works, there are problems with the yield, purity and isolation of the compound under study.
5. The gap between computational and experimental work becomes even wider, as computational methods use metrics that are not always transferable to the latter, such as Quantum Electrodynamics (QED), which describes all phenomena involving charged particles interacting by means of the electromagnetic force, so that its experimental verification may not be feasible. There is also the need for a better database. However, the problem of the lack of benchmarks arises. Since the entire chemical space is infinite, it is hoped to have a sufficiently large sample which may help in subsequent generalisation. Nevertheless, most of today’s databases are designed for different purposes and often use different file formats. Some of them have no validation procedures for submissions or are not designed for AI tasks. It should also be said that most of the databases available have a limited scope of chemical data: they only contain certain types of molecules. Furthermore, most tasks involving the use of Artificial Intelligence for chemical predictions have no reference platform, thus making the comparisons between many different studies impracticable.
One of the main reasons for the success of AlphaFold – which, as already seen, is an AI programme developed by DeepMind (Alphabet/Google) to predict the 3D structure of proteins – lies in the fact that it has provided all of the above as part of the critical evaluation of Protein Structure Prediction, i.e. the inference of a protein 3D structure from its amino acid sequence, e.g. the prediction of its secondary and tertiary structure from its primary structure. This evaluation demonstrates the need for organised efforts to streamline, simplify and improve other tasks involving chemical prediction.
In conclusion, as we continue to advance in the digital age, new algorithms and more powerful hardware will continue to lift the veil on previously intractable problems. The integration of Artificial Intelligence into chemical discovery is still in its infancy, but it is already a commonplace to hear the term “data-driven discovery”.
Many companies, whether pharmaceutical giants or newly founded start-ups, have adopted many of the above technologies and brought greater automation, efficiency and reproducibility to chemistry. Artificial Intelligence enables us to conduct science on an unprecedented scale and in recent years this has generated many initiatives and attracted funding that will continue to lead us further into an era of autonomous scientific discovery. (2. continued).