The data masking component should support masking of data through pseudonymising and anonymising. The purpose of data masking is to protect actual data, but still have a functional substitute so it can be used when real data is not required. Furthermore, data should be operationalised so it still can be used for instance for testing or machine learning and dynamic masking should be possible based on user rights.
As shown in the attached document ‘070520 — External dialog meeting regarding data masking on the page ‘Data flows for the data platform’, data is ingested from various new and legacy systems into a data staging area, from where the data is transferred into the data lake or data warehouse of the core data layer. Transfer can be carried out using an ETL process. Analysts, developers and data scientists can access data in the core data layer through workspaces, which each have their own isolated data area (data store) on the data lake. Data access from the workspaces to the data warehouse is expected to be handled via an ETL process. Furthermore, the data warehouse can directly be accessed from PowerBI. We expect that data masking will be required:
— when transferring data from the staging area to the core data layer, directly or in connection with ETL process;
— when data is transferred from data lake to the workspaces;
— when data is transferred from the data warehouse to the workspaces in connection with an ETL process;
— when data in the data warehouse is accessed via PowerBI;
— when data is transferred to test and pre-production environments.
In the attached file ‘070520_Data Masking’, we have identified initial requirements, that the data masking component should support. It includes both data masking-specific as well as cross-functional requirements. These initial requirements are not a complete list and they are up for discussion whether the vendor sees them as feasible and would recommend for a data masking component.
Where relevant the data masking component should integrate with already set technologies on the data platform, including:
— Azure Data Lake,
— Azure SQL Synapse Analytics (data warehouse),
— Azure SQL Database,
— Azure Data Factory,
— Azure Automation,
— Azure Pipelines,
— Azure Active Directory,
— Azure Data Bricks,
— Azure Data Factory Data Flow,
— Azure Machine Learning,
— Azure Container Registry,
— Azure Key Vault,
— Microsoft Power BI,
— Azure Container Instance.