Machine Learning Pipeline for Biochemistry

Healthcare & Life Sciences
Artificial Intelligence & Machine Learning
R, Python

Our client is a major international pharmaceutical company, which conducts research and development activities related to a wide range of human medical disorders, including mental illness, neurological disorders, cancer, and other disorders.

Business Challenge

For every pharmaceutical company, it is vitally important to maintain its drug discovery pipeline—a set of drug candidates under development. After all stages of R&D and clinical research, some of these drug candidates will become new drugs. The drugs discovered by the company are protected by a patent, but only for a limited period of time. This means that the company is always forced to discover new drugs in order to stay profitable. Still, the search of drug candidates is extremely expensive, as it implies thousands of experiments to find substances having desired effect on biotargets.

Our client set us the task to deliver a very fast and highly scalable data pipeline that uses different machine learning algorithms to learn and predict chemical compound activity to reduce the number of real experiments.

The provided solution should be able to handle different types and formats of both input and output data. The pipeline should be flexible and allow users to adapt it for their needs. It should be possible to distribute complex calculations on CPU core and cluster level. It should also be possible to use multiple GPU devices for computationally expensive calculations.


Python was chosen as the main development language since scientific teams mainly use Python scripts and Jupyter notebooks for their research. We created a library of tools allowing users to construct data pipelines according to their actual business needs. The library has an intuitive user interface resembling well-known Python libraries such as scikit-learn.

There are 7 major blocks in the library:

  • tools for preprocessing a chemical compound
  • tools for fingerprinting
  • tools for clustering
  • tools for folding
  • tools for training supervised machine learning and deep learning models
  • tools for uncertainty estimation
  • model-serving API.

Data preparation

Tools for preprocessing allow users to

  • convert chemical representation of a compound from different formats into one chemical notation (e.g. canonical SMILES),
  • standardize molecule representations,
  • remove heavy atoms, and so on.

Tools for fingerprinting use various methods to transform a molecule into a numerical tensor.

As the client required the possibility to use in the above-mentioned tools a wide range of algorithms both from popular packages (Chemaxon and RDKit) and from the latest research, our team:

  • adapted and implemented algorithms from the latest scientific publications,
  • created Python wrappers for Chemaxon-based tools, since it is written in Java,
  • created wrappers for some RDKit-based tools to avoid several minor bugs in this package.

Machine learning tools and algorithms

Before starting the machine learning process, the dataset should be split into training, validation and test sets (data folding); and before that the data need to undergo clustering analysis—finding subsets of similar objects. This is necessary to make each subset ‘representative’, i.e. different groups of objects (defined through clustering) in each subset should be represented in approximately the same ratio as in the whole dataset. The developed solution performs data clustering and folding using AI, namely unsupervised learning method. After that, the generated datasets are used for supervised learning.

Tools for clustering use classical algorithms as well as advanced ones. We had to modify the Faiss library written in C++: we created a Python wrapper adding the Tanimoto similarity metric, as the latter is the most appropriate choice for fingerprint-based similarity calculations.

Tools for data folding can handle multi-task (see case ‘Multitask Machine Learning’ for details) multi-class folding efficiently using all available CPU cores and produce data folds which comprise the same fraction of data from each of the clusters.

For supervised learning we used various algorithms including

  • classical ML such as SVM, KNN, SGD, etc,
  • tree-based algorithms such as XGBoost, NGBoost, etc,
  • Pytorch-based deep learning algorithms and
  • message-passing neural networks (MPNN). The latter are highly powerful tools for predicting properties of molecular graphs. We implemented some of them from scratch based on scientific papers.

Results evaluation

For uncertainty estimation we included Mondrian inductive conformal predictors and Venn-ABERS predictors. Also, ensemble-based models can compute aleatoric and epistemic uncertainties.

ML models and uncertainty estimators are collected in a single entity which allows to

  • train a single model or
  • do k-fold cross validation or
  • do nested cross validation.

It also allows users to perform the hyperparameter optimization by two different methods: the grid search and the tree-structured Parzen estimator (TPE).

The training procedure demonstrated high performance since the computation is distributed over all CPU cores and multiple GPU devices.

Additionally, users can fine-tune the models or use the models and uncertainty estimators with third-party active learning frameworks since they have the scikit-learn-like interface.


All the data pipelines can be exposed via API. Users (chemists performing machine learning model training) can create a simple YAML file in which they describe which methods they would like to use to preprocess the compound and make the fingerprint, and which model they would like to use for inference. Then the user starts the Flask server.

After that, clients (other chemists) can use the trained model on the server for predicting properties of different chemical compounds — they can query either a single compound or a batch of compounds. The performance is high and allows using this API in real-time applications, such as molecular drawing tools: the chemist draws a molecule and the tool through API predicts its characteristics ‘on the fly’.

The code quality is ensured by a large number of unit tests covering almost all the use cases.

The product is well documented and comprises a lot of sample Jupyter notebooks for different situations.

Results & Benefits

Our solution made it possible to handle large amounts of data provided in different types and formats that were common for the scientific community. The pipeline utilizes different machine-learning algorithms allowing distribution of complex calculations over multiple CPU cores, multiple GPU devices or even a cluster.

Using Python as the main development language in combination with an intuitive user interface makes it easy to get started with the library. It also allows users to integrate building blocks from our library to third-party Python scripts and tools, e.g. active learning framework.

High-performance code reduces the training time of ML models and allows users to make inference in the real-time applications.

Related Cases

Read all

Online Robotics Simulation Application

An educational robotics kit—a browser app simulating the whole process of building, programming and testing a robot.

Implementing LTI 1.3 for LMS

Implementation of the latest version of the standard, LTI 1.3 and in particular LTI Advantage.

OneRoster 1.2 Integration for LMS

A solution for passing grade information from the LMS to a student information system (SIS)