What is data masking and why all of a sudden it should be so important for my business?
So, what is data masking? Here's an example of how it works.
Corporate players have massive amounts of data. This can be around 10 TB of production data or even more. Suppose, during the data analysis stage on a BI Reporting project, something went wrong – some figures didn’t match or don’t make sense – and it should be fixed. Clearly, the solution would be to do the production data backup and dig into the case. It is also clear that access to the production data should be limited to a restricted circle of people. Backed up data is still the same production data, simply existing in another location. In order for the development team to find what went wrong, someone should mask the data in such a way that it still makes sense for the ETL developers to test it properly and the bug could reproduce.
Thus, data masking is a technique of creating a fake version of your organization's data that has a similar structure but different content. You can use it for purposes such as software testing or user training. The goal is to protect the real data and use the fake one when the real data is not necessary.
Let’s have a look at two most common examples of applying data masking in business.
- Say, your financial/insurance or banking organization outsources ETL or data cleansing jobs to a third party. Let’s imagine the same bug we were talking about before shows up again. To capture it and get rid of it most effectively, you’ll need a fresh version of data from the live system for testing purposes. Thus the tests can use real data that reflects the business better, with the most confidential parts omitted. Usually, the results are more trustworthy when tests run on real data, instead of randomly generated data sets.
- Another use case for data masking can be related to the API usage (for training or business purposes). With proper data masking in place, you can let users access real-time data from the live system, but only show them the fields they are allowed to see based on their role or security level. The rest of the fields are hidden by fake values that still make sense (e.g. showing that a field existed or that a date field's value is close to the real one).
Ideally, a team of business and data analysts together with database administrators combine their forces to mask the data properly. They mask the fields containing all the confidential data so that the production data can be safely handed over to the development team. This is especially relevant for data jobs outsourcing (i.e. in case businesses outsource ETL development, ETL testing, data management, data cleansing and other data jobs to 3rd party vendors).
So, yes, it’s all about safety.
“Why bother applying data masking when I have encryption and firewalls?” Great question, let’s move on to the answers.
How does data masking work and why encryption isn’t enough anymore for me?
Data masking and encryption are not the same things. Encrypted data can be unlocked with the right key and go back to its original shape. With masked data, the original values cannot be restored. Masking makes a copy of data that looks real but is fake, and has no appeal to hackers.
Have a look at the screenshot below to see how payment data changes after masking (original data is to the left). As you can see, data masking does not apply to all the fields, but only those that may have potential value for the intruders.
What you need to mask changes depending on your situation. Say, in one case, this would be PII columns only – which is full name, social security number, passport number, home address, date of birth and other items pertaining to Personal Identifiable Information. For another case, this may not be enough, so you’ll also need to mask financial data, for example, like in the screenshot above.
After all, what you must mask should be defined by business and data analysts, not the development team, and come from your unique business situation.
Aren’t firewalls protecting my data? I was thinking of getting more of them.
In a company with 10-12K employees, thousands will have access to the production base. Those may include business analysts, developers, project managers and even Jenny from accounting. One may try to secure their data using firewalls, but that will lead you even further away from the real problem.
What is a firewall? It’s a network security device that inspects incoming and outgoing network traffic. Guided by the corporate security rules, it accepts or rejects data packets coming in and out of your local network. The word initially comes from the firefighter's terminology and basically is a fire-resistant wall that separates the house structures in order to prevent the spread of fire. Hence the term. Firewall is designed as a shock-resistant barrier between your internal network and incoming traffic from external sources (such as the internet) to block malicious traffic like viruses and hackers.
So, does firewall protect your data from hacking? Yes. But nobody is going to try and hack you if they can find your developer’s profile on LinkedIn, buy him or her a beer and discuss the price at which they are ready to bring your data to them. As simple as that.
Okay, I’m sold. How do you solve this problem and how fast can you do this?
For a corporation with a huge heritage consisting of dozens of legacy databases it may take years. “Why so dramatic?” - you may ask. Let’s see.
Say, you have around 50-60 databases, or even more. Around 3-5K employees have regular access to those databases. The databases are huge, as they contain thousands of rows of historical data. In case such data needs to be masked, you need experts who’ll be able to read them, get them and do the masking in such a way that the data still makes sense when masked. There are hardly many such experts readily available in the market, nor are they sitting and waiting for a data masking job in your company. All too often, no one really knows what is going on in those legacy databases due to high talent turnover in big companies. As for developers, in 99% cases engineers are too scared to even touch those databases, let alone change a thing about them.
Shortly, the key blockers that won’t allow tackling the data masking problem fast are the following:
The verdict: one cannot solve the data masking issue that fast. The good news is that you can already start moving in the right direction already now. Even better – with the advancement of technology (and it just won’t stand still) this process is to pick up speed naturally in the future.
Who do we need and how much will it cost us to start data masking in our company?
First thing first, you’re going to need data analysts who’ll be able to see that data for what it is and collect high-quality feedback from the developers on how they use the data.
Second, you’ll need the tools. Data masking is nowhere new. Not so long ago – and in many places now – it was a partially manual job. While the business unit decides on which data should be masked, the development unit implements the job, however long and inefficient the process may be. In 2023, you definitely would want to automate this. Why? Let’s see.
One of our corporate clients has their backups done for them once a week. The backup process takes three days due to the volume of the data. Masking will take as much, so with no special data-masking automation tools in place the whole process is twice as long and will probably cost you twice the original estimate.
Thus, the problem lies in talents, wages and tools. If wages and tools in a nutshell aren’t problems at all, the talent part persists. Those talents must be very (and I mean it) sharp, quick learners or – as a rule – data scientists with special background and engineering skills.
What are the compliance risks I face if I skip data masking?
Not ready to go with data masking yet? Wait-and-see mode is a common thing for corporate businesses, having too much on their plate. Yet, let’s have a very brief outlook at the real risks that may cause real losses for you.
If anything, it’s close to impossible to predict future changes in legislation. To be on the safe side, let’s assume the penalties for data breaches will rise and liabilities – increase. As for now, if you’re a US-based company, you’ll have to bear in mind at least four legislative standards/regulations you might risk violating by skipping the data masking part.
- California Consumer Privacy Act (CCPA)
- General Data Protection Regulation (GDPR)
- Health Insurance Portability and Accountability Act (HIPAA)
- Payment Card Industry Data Security Standard (PCI DSS).
That being said, in 2021 Luxembourg officials fined Amazon $877 million for violating the GDPR, according to the company's financial records. In 2022, Instagram was fined $403 million by the DPC in Ireland for violating the GDPR's children's privacy provisions. Yes, big players can get away with data breaches with a paycheck and a couple unfavorable mentions in the media. However, 60% of smaller businesses go out of the game 6 months after being hacked due to financial and reputational damages. After all, money isn’t everything, not for everyone.
Concluding lines: What do you propose to solve the data masking issue for my business?
Before elaborating on your unique data masking strategy, there are three important things you should do.
- Do research on data masking tools (or consult your vendor to do it for you). If you’re a corporate player, chances are you’ll be sticking to the Microsoft family of products, so Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics all support dynamic data masking. AWS DMS is a way to go, too, for those not restricted in their choice of tools by the Microsoft ecosystem.
- Start looking for the first candidates among your projects. Avoid legacy databases that even your most hardcore engineers are scared to look at. Give it a fresh start with a newer project, where you can find knowledgeable experts ready to work with it. This, in its turn, will help you overcome another issue.
- Talents. This one can be a challenge. Outsourcing is a way out, yet make sure your vendor has experience working with massive amounts of sensitive data (look for finance, insurance or banking-focused vendors).
That’s it for today as far as data masking is concerned. Nowadays, data security solutions abound and data masking can become another valuable technique in your data security management strategy.
But before you go, I’d like to double check on one little thing. You do remember that data leakage has nothing to do with firewalls? You may have a whole firewall team building a firewall on top of a firewall, and still it won’t change a thing about your stolen data. The difference between hacking and data leakage is that hacking is hard. Who would want to do it the hard way if there’s a chance you might not bother at all masking your data? Gives something to think about, doesn't it?