How we moved an in-house big data ETL platform to the cloud. Part II

This is why a BSA with domain knowledge is so important.

undefined
11 min read
Business intelligence
Client stories
ETL/ELT
Engineering insights

We’ve built an on-prem big data ETL platform and partly moved it to the cloud for our client. The cloud transition happened after the largest Azure service partner in the world  – the 3rd party for our client – said it could do better in the cloud and initiated the transition themselves. 

In theory, this would take only a few months and ensure a faster data transfer, verification and mapping than an on-prem solution could do. The clean stabilized data is used then for reporting and reconciliation for a Fortune 500 carrier. In practice the process initiated and led by the renowned cloud vendor took more than a year. It resulted in a bulky, though very technologically advanced solution, that is to undergo a lot of further improvements in the coming months (if not years).An on-prem big data ETL platform was built and partially moved to the cloud for a client.

Hi, I’m Ivan Sokolov, BI and ETL developer at Symfa, and this is the second part of my story about the big data ETL platform that we moved to the cloud.

The first part of the story is here. You’re very much welcome to read it if you’re unfamiliar with the project I’m talking about.

Let’s see now how the cloud part of the Big Data ETL Insurance Platform works and what’s good and bad about it.

Table of Contents

  • What our data management routine is like now as we’re in the cloud?
  • What went wrong with the cloud infrastructure?
  • New project leaders are dissatisfied with the cloud part
  • If that means that many hurdles, why even bother moving to the cloud?
  • How much exactly does this speed cost to the client?
  • Challenges: What to expect if you’d like to build the same platform in the cloud
  • After all, how did this huge two and a half year long initiative result for the client?
  • At the end of the day…

What our data management routine is like now as we’re in the cloud?

From what you can see in the picture below, the source data is in the left part, the Preprocessor. This is the only on-prem part of the Insurance Platform now. If you proceed to the right, the data processing steps are like carriages in a train, which end with the data storage in the tail – all my part of the job in the cloud.

It is important to understand that the huge work of the on-prem team – Symfa’s one included 23 people – stands behind the data that we – the cloud team – receive already mapped and verified. The team has been partially disbanded because they did their job very well and the project is coming to its end.

Data flow

Preprocessor pulls out the data from the source folders, maps, verifies and stores on local servers. Then comes the infrastructure that was prepared by the cloud vendor that I use to do my part of the job.

But before I could do this, there were a lot of infrastructure hurdles and – apparently – miscommunication issues between the client and the cloud vendor that resulted in delays and wasted resources.

What went wrong with the cloud infrastructure?

Initially, the cloud vendor did the PoC, tested it on one big contractor’s set of data and left to build the muscles of the solution and extend it. Our on-prem solution processes about 350 businesses, with 120 new businesses planned to be added soon to the platform. So, the cloud vendor had a lot of work to do to turn the PoC into an MVP capable of processing about 500 insurance businesses. 

To put it short, it didn't work, when they came back with the “ready” solution.

“No, problem, life happens” – was the polite reaction of the client and together – Symfa, client and the cloud vendor – we were upgrading, redoing, fixing bugs, improving and bringing the MVP into working condition throughout the autumn and winter 2023.

In May were still doing the improvements, redoing some parts and testing the uploads.

New project leaders are dissatisfied with the cloud part

It so happens that the cloud vendor has built a very technologically advanced monolith.

The customer has hired some new leaders to polish and customize the platform. My new manager has worked in insurance longer than I’ve been alive, or so it seems to me. He says that the cloud part … isn’t quite right. It makes sense what’s happening under the hood, but it is cumbersome and complicated. This is what he says. 

Say, we have a pipeline. And sometimes you need to run only a piece of it. For a certain array of data, you have to run it end-to-end and spend a lot of capacity on it, while in reality you need only 30 percent of the whole process. This translates into extra costs.

That is, a lot of things need to be redone, half of the solution – improved so that it becomes flexible and easy to use. Symfa is part of that improvement process, too. It’s a pleasant challenge for me, because we are learning a lot from it. The customer, they probably don’t feel the way I do, ‘cause the paycheck is already way too big.

If that means that many hurdles, why even bother moving to the cloud?

It’s the speed of the data processing.

Recently, we tried to load the test data to imitate the production phase of the project. Our manager filled in all the business lines in just 12 hours. She ramped the servers up – and will pay a huge bill for that – but she processed six business lines – Claims and Premiums – in half a day, which would previously have taken us a week, probably!

How much exactly does this speed cost to the client?

I cannot tell you, unfortunately, as I’m under the NDA. But Azure provides a pretty clear cost analysis, cost alerts  – everything you need. Azure is super transparent about telling you that you’re paying a lot.

Azure cost management and billing

Azure cost management and billing

The good news is that you’re not paying for nothing. By the end of the project, we will have the exact numbers, but already now it’s obvious that the cloud can do a lot faster than the local servers.

Challenges: What to expect if you’d like to build the same platform in the cloud

  1. When you transfer a solution to the cloud, the most obvious thing is that what works on local servers will definitely not be possible to copy in the cloud. 
  2. You need to be able to convey business logic, and for that you have to try and find a vendor who understands the domain and – most importantly – your internal processes.

If the vendor doesn’t know what they are doing, it doesn’t matter how cool they are doing it. The client’s engineers saved the day, to be honest, but the losses were huge.

The whole initiative was supposed to take a few months. Here we are in May, the PoC delivered more than a year ago and the last cloud vendor’s developer left in March.

After all, how did this huge two and a half year long initiative result for the client?

The Preprocessor (99.9% complete)

  • 6 business lines – Warranty, PI, Property, M&C, LE, and A&H
  • 220 mln records in the database (and this isn’t the limit)
  • 350+ businesses processed (120 more are about to be added soon)
  • Reconciliation for the period from 1995 to 2023 – 28 years worth of data mapped, verified and brought to order
  • Files in xls, xlsb, csv, txt, xlsm, access database, sql database converted to a single format of xlsx

The Cloud part

I work with the data that ends up in the Landing Zone – starting point of the cloud section, and till the very Global Model in the cloud DWH.

PICTURE

Data Lake & Data Factory

Data Lake serves as the cloud data storage, where data moves from phase to phase: from the bronze layer, to the silver and the gold one. Each layer is responsible for some of its own data preprocessing. At some stage, an additional source or a business logic is added, and finally the data turns into a model.

Data Factory works as an orchestrator. 

To make it clearer for you, Data Factory is like a pipeline: we go from left to right, copy information, transform it, run some pieces of code. We had exactly the same thing on local servers in the Astera Centerprise app. The process is more or less standard and differs slightly for each new ETL tool. That is, Data Factory is a tool that allows us to configure these pipelines. It visually displays how data is cleaned, verified, mapped, etc. We tend to use the word orchestrator on the project for Data Factory. “You, piece of data #1, go here and you, piece of data #134, go there”.

Gif 2

Example: What is it like to work with Data Factory

Under the shell of Data Factory there are Synapse notebooks that are written in PySpark. This is the code that does the work at each of the data preprocessing stages.

Ivan's Article Gif (3)

Example: What do Synapse notebooks look like? 

Cloud DWH

This is the data model with the central element – the main table – and dimensions around it.

The Preprocessor table of 500 columns is presented neatly in the cloud DWH as 10-15 columns. The rest of the columns – dimensions – are decomposed into rays which describe and complement what happens in the very center. The whole cloud process turns a 500-column table into many convenient and interconnectable tables, so that you can use them like a constructor – this is what data normalization is all about.

We – the developers – use the data that we prepare ourselves to help the client with the reconciliation, earnings and accruals. Besides us, the data is prepared for a huge base of business consumers.

Business consumers' part

There are many teams that sit after the Cloud DWH. 

  • The company compares the declared data against actual receipts and transaction bank data. The so-called reconciliation. Based on the data from the DWH, they do the analysis, whether all the transactions match the final amounts.
  • Premiums vs. Claims comparison to see how much profit (or loss) the business is making.
  • OLAP cubes for forecasting and performance analytics. This is no big news, every corporation uses big data for OLAP cubes, and we’re no exception.
  • The federal reporting teams also download the data and transform it as they need, in a very quick and convenient manner.

This is the visualization, on the one hand, and client's reporting on the other.

At the end of the day…

In August we go live with the final version of the system, where Preprocessor does the initial data prepping and the cloud does the rest. From this moment on, all systems, all business users who need it, can connect to this server. Insurance Platform already is generating value, but it will do more, a lot more.

This is just the backend. How the client uses it depends on their teams. Anything is possible, if you know exactly what you want.

Credits

Ivan Sokolov
Ivan Sokolov

Business Intelligence Developer

Ivan is an aspiring young leader of the corporate BI universe. He's constantly challenging himself with new approaches to data processing and experimenting with tools. Ivan is residing in Georgia with his wife and two kids.

Ivan is an aspiring young leader of the corporate BI universe. He's constantly challenging himself with new approaches to data processing and experimenting with tools. Ivan is residing in Georgia with his wife and two kids.

More Like This

BACK TO BLOG

Contact us

Our team will get back to you promptly to discuss the next steps