Big Data ETL

Industries
Healthcare & Life Sciences
Expertise
Data Engineering & Business Intelligence
Technologies
SQL Server, .NET
Client

Our client is a major international pharmaceutical company conducting research and development related to a wide range of human medical disorders, including mental illness, neurological disorders, anaesthesia and analgesia, gastrointestinal disorders, fungal infection, allergies, and cancer.

Business Challenge

The amount of data required for healthcare researchers to be able to do their work increases daily. Our client’s databases contain terabytes of data.

This makes it a challenge to perform operations on data—loading new datasets, transforming data, and querying resulting databases—in manageable time.

It is also critical to make sure that the data is of adequate quality, as the results of the medical research depend on it.

Another challenge is to lower the costs of ETL process by automating as much work as possible. 

Solution

The team performed database and application design, development, quality assurance; data extractions, loads, and transformations; administering of databases and query optimization; users support and issue tracking.

In order to be able to operate on huge amounts of data, the latest advanced SQL Server Enterprise edition features were used, such as page-level data compression, horizontal range partitioning, and partition switching.

Some databases are migrated to the massively parallel database system—Analytic Platform System, which improved query performance significantly.

Data quality was enforced by Validation Analysis System—a utility that collects statistics and finds anomalies within loaded data, and a website displaying results.

The majority of ETL operations were automated by Integration Services packages and CDM Builder application.

CDM Builder is a multithreaded & distributed application for transforming a number of popular databases to the Common Data Model. The purpose of the Common Data Model (CDM) is to standardize the format and content of the observational data.

Results & Benefits

Data load time was significantly improved, in some cases from days and weeks down to hours.

The quality of data was improved with the help of Validation Analysis System.

Data integration process was made mostly automated, requiring little to no human attention.

Related Cases

Read all

RTSM Solution: Data Ingestion Improvement

Removing issues in data architecture and processing in order to provide a solid foundation for future growth of the platform.

LMS Content Import and Export Feature

A solution for importing and exporting content from / to Moodle and IMSCC platforms.

Content Generation with Copilot Studio and MCP Servers

A solution to help new teachers rapidly adapt to the educational system while providing easy access to the existing content base.