Natural Language Processing (NLP) Tasks

Healthcare & Life Sciences
Artificial Intelligence & Machine Learning
Python, R

Our client is a multinational Fortune 500 pharmaceutical company aiming to provide products and services of the best quality to the customers along with high responsibility standards to the patients and to all who use its products.

Business Challenge

During internal audits, lots of data are generated. A large part of the data is free text entered by human users. It includes findings, CAPAs (corrective action / preventive action), and quality investigations. The data is analyzed to discover present and emerging issues. The customer would like to employ NLP (natural language processing) to help humans discover such patterns.


Topic Modeling and Text Search applications have been developed for discovering common topics and for similarity search across audit findings. R is the implementation language, with Shiny used for UI.

For topic modeling, we used LDA (Latent Dirichlet Allocation), which considers each document to be a mixture of a relatively small number of topics (e.g. 20 topics). In our case, documents are free text information from findings. LDA tries to identify commonalities in the set of documents and outputs proportions of each topic within each document. These are then analyzed by a human using the Shiny app to see if discovered topics are indeed cohesive and if insights can be drawn from them.

For similarity search, both user queries and documents are represented as numeric vectors (called document embeddings). Documents whose vectors are most similar to the query vector are returned to the user as the search result. Shiny is used for UI.

Libraries used: tm, topicmodels, stm, quanteda, stopwords, textclean.

At the same time, more sophisticated approaches to pre-processing and document embeddings are being developed using Python and libraries from its NLP ecosystem. We have acquired document embeddings using BERT and ULTFiT. One use of the embeddings is document clustering, i.e. identification of groups of similar documents. Another use of the embeddings is predicting some of the categorical variables that users typically enter manually, e.g. which compliance topic the finding belongs to. This can be used for providing a user with hints during data entry.

Libraries used: NLTK, Flair, spaCy.

Results & Benefits

Our solution helps non-technical users discover patterns in large amounts of textual data, which may facilitate identifying issues. The project is still an ongoing effort, with automatic summarization and anomaly detection on the way. This will reduce the human effort required to work with the finding database and will also help to point out unusual findings.

Related Cases

Read all

Multitask Machine Learning

A solution maintaining multi-task learning, i.e. ability of the AI to solve several learning tasks at the same time.

Online Configurator of Balcony Structures

Develoment of online portal for automatic calculation of project cost based on multiple parameters.

Revamping Online Store and Warehouse Management System

Our team updated, upgraded and restructured a complex system serving online shops and storage facilities.