How a Big Data Analytics Platform for Finance Teams Was Moved to the Cloud

What our data management routine is like now as we’re in the cloud?

From what you can see in the picture below, the source data is in the left part, the Preprocessor. This is the only on-prem part of the Insurance Platform now. If you proceed to the right, the data processing steps are like carriages in a train, which end with the data storage in the tail – all my part of the job in the cloud.

It is important to understand that the huge work of the on-prem team – Symfa’s one included 23 people – stands behind the data that we – the cloud team – receive already mapped and verified. The team has been partially disbanded because they did their job very well and the project is coming to its end.

Data flow

Preprocessor pulls out the data from the source folders, maps, verifies and stores on local servers. Then comes the infrastructure that was prepared by the cloud vendor that I use to do my part of the job.

But before I could do this, there were a lot of infrastructure hurdles and – apparently – miscommunication issues between the client and the cloud vendor that resulted in delays and wasted resources.

What went wrong with the cloud infrastructure?

Initially, the cloud vendor did the PoC, tested it on one big contractor’s set of data and left to build the muscles of the solution and extend it. Our on-prem solution processes about 350 businesses, with 120 new businesses planned to be added soon to the platform. So, the cloud vendor had a lot of work to do to turn the PoC into an MVP capable of processing about 500 insurance businesses.

To put it short, it didn't work, when they came back with the “ready” solution.

“No, problem, life happens” – was the polite reaction of the client and together – Symfa, client and the cloud vendor – we were upgrading, redoing, fixing bugs, improving and bringing the MVP into working condition throughout the autumn and winter 2023.

In May were still doing the improvements, redoing some parts and testing the uploads.

New project leaders are dissatisfied with the cloud part

It so happens that the cloud vendor has built a very technologically advanced monolith.

The customer has hired some new leaders to polish and customize the platform. My new manager has worked in insurance longer than I’ve been alive, or so it seems to me. He says that the cloud part … isn’t quite right. It makes sense what’s happening under the hood, but it is cumbersome and complicated. This is what he says.

Say, we have a pipeline. And sometimes you need to run only a piece of it. For a certain array of data, you have to run it end-to-end and spend a lot of capacity on it, while in reality you need only 30 percent of the whole process. This translates into extra costs.

That is, a lot of things need to be redone, half of the solution – improved so that it becomes flexible and easy to use. Symfa is part of that improvement process, too. It’s a pleasant challenge for me, because we are learning a lot from it. The customer, they probably don’t feel the way I do, ‘cause the paycheck is already way too big.

If that means that many hurdles, why even bother moving to the cloud?

It’s the speed of the data processing.

Recently, we tried to load the test data to imitate the production phase of the project. Our manager filled in all the business lines in just 12 hours. She ramped the servers up – and will pay a huge bill for that – but she processed six business lines – Claims and Premiums – in half a day, which would previously have taken us a week, probably!

How much exactly does this speed cost to the client?

I cannot tell you, unfortunately, as I’m under the NDA. But Azure provides a pretty clear cost analysis, cost alerts – everything you need. Azure is super transparent about telling you that you’re paying a lot.

Azure cost management and billing

The good news is that you’re not paying for nothing. By the end of the project, we will have the exact numbers, but already now it’s obvious that the cloud can do a lot faster than the local servers.

Challenges: What to expect if you’d like to build the same platform in the cloud

When you transfer a solution to the cloud, the most obvious thing is that what works on local servers will definitely not be possible to copy in the cloud.
You need to be able to convey business logic, and for that you have to try and find a vendor who understands the domain and – most importantly – your internal processes.

If the vendor doesn’t know what they are doing, it doesn’t matter how cool they are doing it. The client’s engineers saved the day, to be honest, but the losses were huge.

The whole initiative was supposed to take a few months. Here we are in May, the PoC delivered more than a year ago and the last cloud vendor’s developer left in March.

After all, how did this huge two and a half year long initiative result for the client?

The Preprocessor (99.9% complete)

6 business lines – Warranty, PI, Property, M&C, LE, and A&H
220 mln records in the database (and this isn’t the limit)
350+ businesses processed (120 more are about to be added soon)
Reconciliation for the period from 1995 to 2023 – 28 years worth of data mapped, verified and brought to order
Files in xls, xlsb, csv, txt, xlsm, access database, sql database converted to a single format of xlsx

The Cloud part

I work with the data that ends up in the Landing Zone – starting point of the cloud section, and till the very Global Model in the cloud DWH.

PICTURE

Data Lake & Data Factory

Data Lake serves as the cloud data storage, where data moves from phase to phase: from the bronze layer, to the silver and the gold one. Each layer is responsible for some of its own data preprocessing. At some stage, an additional source or a business logic is added, and finally the data turns into a model.

Data Factory works as an orchestrator.

To make it clearer for you, Data Factory is like a pipeline: we go from left to right, copy information, transform it, run some pieces of code. We had exactly the same thing on local servers in the Astera Centerprise app. The process is more or less standard and differs slightly for each new ETL tool. That is, Data Factory is a tool that allows us to configure these pipelines. It visually displays how data is cleaned, verified, mapped, etc. We tend to use the word orchestrator on the project for Data Factory. “You, piece of data #1, go here and you, piece of data #134, go there”.

Gif 2

Example: What is it like to work with Data Factory

Under the shell of Data Factory there are Synapse notebooks that are written in PySpark. This is the code that does the work at each of the data preprocessing stages.

Ivan's Article Gif (3)

Example: What do Synapse notebooks look like?

Cloud DWH

This is the data model with the central element – the main table – and dimensions around it.

The Preprocessor table of 500 columns is presented neatly in the cloud DWH as 10-15 columns. The rest of the columns – dimensions – are decomposed into rays which describe and complement what happens in the very center. The whole cloud process turns a 500-column table into many convenient and interconnectable tables, so that you can use them like a constructor – this is what data normalization is all about.

We – the developers – use the data that we prepare ourselves to help the client with the reconciliation, earnings and accruals. Besides us, the data is prepared for a huge base of business consumers.

Business consumers' part

There are many teams that sit after the Cloud DWH.

The company compares the declared data against actual receipts and transaction bank data. The so-called reconciliation. Based on the data from the DWH, they do the analysis, whether all the transactions match the final amounts.
Premiums vs. Claims comparison to see how much profit (or loss) the business is making.
OLAP cubes for forecasting and performance analytics. This is no big news, every corporation uses big data for OLAP cubes, and we’re no exception.
The federal reporting teams also download the data and transform it as they need, in a very quick and convenient manner.

This is the visualization, on the one hand, and client's reporting on the other.

At the end of the day…

In August we go live with the final version of the system, where Preprocessor does the initial data prepping and the cloud does the rest. From this moment on, all systems, all business users who need it, can connect to this server. Insurance Platform already is generating value, but it will do more, a lot more.

This is just the backend. How the client uses it depends on their teams. Anything is possible, if you know exactly what you want.

How we moved an in-house big data ETL platform to the cloud. Part II

This is why a BSA with domain knowledge is so important.