Natural Language Processing (NLP) Tasks

Artificial Intelligence & Machine Learning

Our client is a multinational Fortune 500 pharmaceutical company aiming to provide products and services of the best quality to the customers along with high responsibility standards to the patients and to all who use its products.


During internal audits, lots of data are generated. A large part of the data is free text entered by human users. It includes findings, CAPAs (corrective action / preventive action), and quality investigations. The data is analyzed to discover present and emerging issues. The customer would like to employ NLP (natural language processing) to help humans discover such patterns.


Topic Modeling and Text Search applications have been developed for discovering common topics and for similarity search across audit findings. R is the implementation language, with Shiny used for UI.

For topic modelling, we used LDA (Latent Dirichlet Allocation), which considers each document to be a mixture of a relatively small number of topics (e.g. 20 topics). In our case, documents are free text information from findings. LDA tries to identify commonalities in the set of documents and outputs proportions of each topic within each document. These are then analyzed by a human using the Shiny app to see if discovered topics are indeed cohesive and if insights can be drawn from them.

For similarity search, both user queries and documents are represented as numeric vectors (called document embeddings). Documents whose vectors are most similar to the query vector are returned to the user as the search result. Shiny is used for UI.

Libraries used: tm, topicmodels, stm, quanteda, stopwords, textclean.

At the same time, more sophisticated approaches to pre-processing and document embeddings are being developed using Python and libraries from its NLP ecosystem. We have acquired document embeddings using BERT and ULTFiT. One use of the embeddings is document clustering, i.e. identification of groups of similar documents. Another use of the embeddings is predicting some of the categorical variables that users typically enter manually, e.g. which compliance topic the finding belongs to. This can be used for providing a user with hints during data entry.

Libraries used: NLTK, Flair, spaCy.


Our solution helps non-technical users discover patterns in large amounts of textual data, which may facilitate identifying issues. The project is still an ongoing effort, with automatic summarization and anomaly detection on the way. This will reduce the human effort required to work with the finding database and will also help to point out unusual findings.