A bit of theory: data processing flow in brief
To be presented in a digestible way at a doctor’s screen, patient data is being pulled out of primary sources, such as Electronic Health Records (EHRs), insurance claims, clinical trials, medical equipment databases, and more. First and foremost, we conduct initial identification of raw data from primary sources to make sure it meets our criteria and its structure fits for further processing. After identification, we get down to the development stage, where we filter and clean the data in the received files. Developers minimize the amount of data able to distort the final picture by denoising, and write logic for further data processing. After data is processed, we conduct quality assurance and testing followed by the final verification. The final step is loading clean ready-to-go data to the destination server, in compliance with the predefined rules.
Pitfalls – can we do without them?
Data processing may appear straightforward at first glance, but I assure you, it's far from being simple. Any mistake, even a minor one, may lead to incorrect data loading, which is a serious incident in the context of the Healthcare sphere. Now, let's explore the prevalent errors in data analysis and processing, and understand their impact on patient data management and the efficiency of medical staff.
One of the weightiest and hard-to-fix errors – is an error made at the logic-building stage. Say, a patient has a test for the coronavirus infection, where only two results exist – positive and negative. To do a test the equipment processes the combination of metrics provided by sensors. On the basis of the data provided by sensors, the total value is calculated for result issuance via a formula inserted by a developer. Instead of pulling the raw data out of the equipment database directly, the software analyzes the metrics put in the formula, conducts calculations, and displays the total value, where 0 means negative and 1 means positive.
Of course, the developer doesn’t get the formula and metrics out of his/her mind. As a rule, the formula is provided by a healthcare professional, while the metrics are stored in the equipment database, a programmer has access to. And here lies the danger in incorrect formula writing. Below let’s consider several most common vulnerable points.
- Point 1: Wrong formula interpretation
The first, most common, and gravest – is a human error. Let’s consider a simple example. Say, a developer receives a formula with parameters from the equipment database that should be calculated. Some parameters can have similar names with the difference in one letter or number. And the programmer may simply mix them up and insert a wrong one, which makes the total value completely invalid.
It also may happen that all the parameters were chosen correctly but inserted in wrong places. And voila – parameters that should have been multiplied had been divided.
- Point 2: duplication issues
For example, a patient had a blood test. A health worker uploads the result in the database and accidentally does it twice, which leads to a duplication in the database. This produces a risk of repeat calculation of the same parameter, which affects the total value display and overall statistical data.
Logic coding is the most responsible task when working with data, and such kinds of errors are quite difficult to track. Especially in cases when the logic is tied up with various inbound data and also conditions. For example, one condition worked just right, while the second failed – in such a case we’ll have the foggiest idea of how the formula would perform as a whole. Therefore, our primary task is to minimize the possibility of errors in data analysis through validation and rigorous testing.
Data type discrepancies
Logic functioning is fully and completely the developer's responsibility. But there is something totally independent of a programmer's skills, yet having a direct impact on the software logic functioning and calculations – third-party data types. Let’s consider several examples with floating-point numbers and extreme values below:
- Point 1: floating-point numbers
Say, there is a metric with an infinite value like 4.333333333 and so on infinitely. But in programming, there can not be any infinite values, which means that a programmer has to round them. After values are rounded, they interact with each other in compliance with the given formula, which inevitably leads to bias. In case the rounding was performed with a too big tilt toward increasing or decreasing, the total value will be calculated with a high degree of distortion.
It may happen that the equipment database or any other third-party source itself contains incorrect parameters. For example, the equipment returned the rounded value, while it should have returned the unrounded one. The system logic chooses the parameter with the integer value and gains the result going far beyond the expected range.
- Point 2: extreme values
It happens that after data processing an extreme value is returned, for example, 0 or an amount a hundred times more than the expected result. This occurs due to a variety of reasons: inadequate biomaterial quality, wrong amount of reagent, equipment malfunction, etc. What to do with it? Such cases are entirely at the mercy of a dedicated developer. The developer makes a decision – whether to ignore it, implement it in the given formula, or bring the subject up for discussion with product owners and healthcare experts. The decision depends on the significance of the case and technical specification – there’s no one-fits-all approach, and the issue is solved on an individual basis.
The quality of data processing may be influenced not only by data quality itself but also by the overall system architecture. For instance, architecture-related issues may not allow for automatic data acquisition, which means manual mode inclusion. Given this backdrop, the developer will have to spend additional time filling tables and fixing data, so the system could take up the correct file.
If the file does not correspond to the template it must be processed in, the data will be transformed incorrectly and inserted to wrong columns. For example, the patient had a blood test for 2 hormones – TSH and T4. If data is processed incorrectly due to architecture issues, the result for TSH will be displayed in the field where T4 should have been. To mitigate this issue, the system architecture should be structured in a way that automates the process of retrieving data from data sources, including necessary reconciliations and filters.
You may wonder how date-related errors may occur. In reality, it’s one of the most common mistakes, and seemingly negligible. Imagine, you need to pull out the dates from an Excel file. Excel, in its turn, offers a variety of data formats, which makes it difficult to understand what is what. Say, the date is 08.09.22, where it’s not entirely clear where the day is and where the month is. To clear it up, we analyze the data source (be it Excel or a database) and identify which kind of the date format we faced. The same issue is with time saved in a data source, where you also need to define the time zone. In case the data is processed incorrectly, it will result in binding to incorrect date and time, which is a quite serious incident in the context of data management in healthcare.
Errors in such a sensitive sphere as Healthcare can incur significant costs, extending beyond mere financial implications. The volume of raw data, its quality, diverse formats, and the overall system architecture – all affect the data processing flow and final results upon which medical professionals base their decisions.
Hence, being mindful of potential ramifications that even minor oversight can result in, ETL developers strive to minimize the chances of errors. They achieve this by leveraging automation, meticulous validation, and rigorous data testing in order to minimize risks in the processing pipeline and help medical staff perform their exceptionally responsible job well.