What is ETL testing?
Given the diverse data sources and complex transformations often involved, ETL pipelines can become quite intricate. ETL testing is a crucial component as it ensures that the data extraction, transformation, and load processes work together seamlessly, accurately, and efficiently.
The objectives of ETL testing are multifaceted. The process guarantees data completeness, ensures proper transformation according to business rules, validates the quality of data by identifying anomalies or inaccuracies. It's an integral part of data management, setting the stage for robust data analysis and informed decision-making.
Exploring the toolbox: ETL testing tools and approaches
Just like building an ETL pipeline demands a unique set of competencies, distinct from those needed for traditional software development, ETL testing also calls for specific skills and strategies. And today, I wanted to share the approaches and methods that have proven to be my trusted allies.
Also known as the source-to-target testing, the source-destination check is a fundamental type of ETL testing. It’s a practical way to ensure the accurate and complete transfer of data from source to target, with no data loss.
Given that we usually deal with vast amounts of data, extensive testing of the entire source and the entire destination can take up some time. In my experience, it is useful to start with a high-level check of the most important metrics – the record count and key financial indicators. For instance, if you have 5 thousand records that amounted to 10 thousand dollars in the source, you need to make sure that you have the same numbers in the destination. Any disparities at this stage are the signal for a more thorough investigation to locate the missing data. And to do that, you may need to identify a specific criterion, like a client type, to break down this data into segments. This decomposition can help you find the root of the issue more effectively.
A spot check is a rapid-fire way to verify the accuracy and quality of data without checking the entire dataset. Basically it means that you just handpick a subset of data at random from the overall pool for a closer look. The benefits of this type of testing is that it can be laser-targeted and executed quickly.
However, much like looking for a needle in a haystack, if you don’t know where to look, you can miss some critical findings. That’s why this type of testing often goes hand in hand with the source-destination check – first you start with a bird’s-eye view of your data and then zoom in on those subsets where anomalies are detected.
An integrity check refers to a process used to confirm the accuracy and consistency of data by verifying the relations between the tables of a dataset.
In order to gain a better understanding of this type of testing, let’s have a quick look at the most widely adopted organizational structure for a database. This model is called a star schema and it consists of one or more fact tables connected to any number of dimension tables.
Suppose you're examining a central financial table. Although the number of records in your target system matches those in your source, you might encounter an issue where certain facts in the fact table cannot be further detailed by the dimension tables. This issue arises when the dimension tables aren't fully loaded. In this context, the importance of verifying primary and foreign keys becomes clear.
Let’s say, you need to check the financial table (in the center). Although the number of records in your target system matches those in your source, the fact table may contain the information that cannot be further described through the dimension tables because these tables are not completely loaded. As a result, you’ll just have an endless table with financials but no way of knowing what happened behind those numbers. That’s why it is important to check primary and foreign keys. Ensuring these keys are correctly assigned and maintained helps avoid any dead ends.
An empirical test refers to a testing method that involves observing and verifying the actual behavior and results of the ETL process based on the real data. Simply speaking, it is the process of monitoring and auditing ETL logs in order to see how well your pipeline works.
And finally, let’s consider another, performance-focused, approach to ETL testing. Suppose you get the data correctly but the pipeline is too slow and riddled with problems, affecting the overall system efficiency.
The better your SQL query is written, the more efficient the ETL pipelines are. By using SQL Server Management Studio (SSMS) Execution Plan feature, you can get a graphic representation of how well your code works. This tool breaks down the script and analyzes the cost of a particular query relative to the overall batch cost. Once you know the most expensive part of the query, you can fine-tune it to achieve the best performance with the least resource consumption.
Charting the course of ETL testing flow
Now that we have discussed the fundamental approaches and tools to ETL testing, let’s see how they map against a dynamic data flow. ETL testing is a multi-stage process that starts with the validation of source data, followed by checks during extraction, transformation, and loading stages, and ends in a comprehensive verification of final results.
From theory to practice: What makes ETL testing tricky?
ETL testing certainly comes with its own fair share of hurdles. Embarking on this journey can often feel like navigating a maze filled with twists and turns. I know these challenges first-hand as right now I am engaged on a large-scale data structuring project for a major insurance carrier in the US and European markets. Over the past year, our team faced and conquered the following challenges:
- Massive volume of data. Since our client provides insurance services to hundreds of businesses,the central database swells with billions of records. Further hurdles come from a wide array of data formats – even though end customers have unique specifications for their data, those requirements are not always followed to the letter. As a result, data can arrive in a variety of formats. Many iterations were needed in order to bring everyone and everything on the same page.
- The specific nature of the insurance domain. While some end customers manage to deliver data in the requested format, the content of this data requires a different approach. Thus, we had to adapt and build a starflake – a combination of star and snowflake schemas with interconnected dimension tables.
- Cloud migration. To improve flexibility, boost scalability and accelerate performance, the client decided to move ETL workloads to the cloud. But despite this change in scenery, our ETL pipelines still need to support the same processes as on traditional on-prem servers.
- Incremental load support. Imagine you have loaded 2 billion records and then new 100,000 records come in. Instead of doing it all over again, ETL pipelines must be designed to support the ability to load only new and modified data to ensure faster processing and consistent performance.
The bottom line
As we wrap up our journey through the complexities of ETL testing, it's clear that this process is no easy feat. The data landscape can often be chaotic, and it's the reliable and robust ETL process that helps maintain a smooth flow. From industry specifics to massive data volumes and a variety of formats, the road to effective ETL pipelines can be paved with hurdles. But armed with the right tools and informed by practical experience, ETL engineers can navigate these challenges, enabling high-quality data delivery.