Scalable Data Management in the Cloud ByJacques Conradie

width= Jacques is an expert in the field of data management and analytics, with a proven track record of leading technology change in the financial services and gas/oil industry. He’s Principal Consultant (Data & Analytics) at CGI, and also works for a large Dutch financial services company. In his role as Product Manager, Jacques leads 2 teams responsible for improving the data reliability within the Global Data Platform.
In this post, Jacques discusses the increasing need for effective data management in the financial services industry. He advocates for adopting data mesh principles, which create secure and scalable data management practices within complex organisations. Jacques offers his insights into features of the Global Data Platform, a key example of data mesh principles in practice:

THE INCREASING IMPORTANCE OF DATA MANAGEMENT FOR FINANCIAL SERVICES

If a bank loses the trust of its clients, or takes too many risks, it can collapse.

Most of us recall the financial crisis that began in 2007 and its horrendous aftermath. This included events such as the collapse of Lehman Brothers (an American Investment bank) and the sub-prime mortgages. As a direct result of these events, the Basil Committee on Banking Supervision (BCBS) published several principles for effective risk data aggregation and risk reporting (RDARR).

Today, these principles are shaping data management practices all over the world and they describe how financial organisations can achieve a solid data foundation with IT support. The BCBS 239 standard is the first document to precisely define data management practices around implementation, management, and reporting of data.

THE GLOBAL DATA PLATFORM: A GOVERNED DATA MARKETPLACE

Historically, every use case involving data required a tailored and singular solution. This model for data sharing wasn’t ideal as it provided little to no re-use for other data use cases.

From a data management perspective, business areas were expected to implement tooling across many different data environments. This was sub-optimal as it required time and effort from producers of data to connect their systems to instances of Informatica Data Quality (IDQ).

All these pain points resulted in a slow time-to-market for data use cases and, therefore posed a significant threat to the organisation’s data-driven (future) ambitions.

This triggered a series of questions including, but not limited to, the following:

Can one deliver scalable data platforms that would
enable data sharing between data producers and consumers in a manner that is easy, safe and secure?
Can one democratise data management and make it accessible to everyone?

The end goal was simple: a single “platform of truth” aimed at empowering the organisation to govern and use data at scale. As a means to an end, the Global Data Platform (GDP) was launched.

Today, the GDP is a one-stop (governed) data marketplace for producers and consumers across the globe to exchange cloud data in a manner that is easy, safe and secure. This flexible and scalable data platform was implemented using a variety of Microsoft Azure PaaS services (including Azure Data Lake Storage and Data Factory).

TRANSACTION MONITORING: A CONSUMER USE CASE

As part of customer due diligence (CDD) within financial services, organisations are often expected to monitor customer transactions. A team of data scientists would typically build, deploy and productionise a model whilst connecting to some or other data platform as a source. As an output, the model will generate certain alerts and function as early warning signal(s) for CDD analysts. Without complete and high-quality data, the model could potentially generate faulty alerts or even worse, completely fail to detect certain transactions in the first place. From a BCBS 239 perspective (described earlier), banks are expected to demonstrate a certain level of control over critical data and models. Failure to do this could result in hefty penalties and potential reputational damage.

This is one of many use cases within the context of financial services, and truly highlights the importance of having governed data where data quality is controlled and monitored.

By adopting data mesh principles, the GDP has been able to successfully deliver trusted data into the hands of many consumers.

DATA MESH FOR DATA MANAGEMENT

Although initially introduced in 2019 by Zhamak Dehghani, Data Mesh remains a hot topic in many architectural discussions today. The underlying architecture stretches beyond the limits of a (traditional) single platform and implementation team, and is useful for implementing enterprise-scale data platforms within large, complex organisations.

The associated principles have served as inspiration ever since the beginning of the GDP journey and provided a framework for successfully scaling many platform services including data management.

The following principles form the backbone of the data mesh framework and each of them was carefully considered when the GDP initially geared up for scaling:

For each principle, a brief overview will be provided, followed by an example of the principle in practice.

DATA DOMAINS & OWNERSHIP

Modern data landscapes are complex and extremely large. To navigate this successfully, it is recommended to segregate enterprise data into logical subject areas (or data domains if you will). It is equally important to link ownership to every one of these domains. This is important as data mesh relies on domain-oriented ownership.

Within the context of the GDP, data domains were established around the various business areas and organisational value chains (example below):

Retail Business

Customer
Payments
Savings
Investments
Lending
Insurance

Wholesale & Rural Business

Customer
Payments
Financial Markets
Lending

Within the Retail Business, for example, there exists a Customer Tribe with several responsibilities. These responsibilities typically span from core data operations (like data ingestion) all the way to data managementrelated operations (like improving data quality). This model ensures that each domain takes responsibility for the data delivered.

To support data management-related operations, many domains appoint so-called data stewards, because data governance is still being treated as something separate and independent from core data operations. However, it is not feasible to increase headcount in proportion to the vast amounts of data that organisations produce today. Instead, data stewardship should be embedded within those teams who are building and delivering the data.

DATA-AS-A-PRODUCT

Data-as-a-Product is another important data mesh principle and challenges the traditional perspective of separating product thinking from the data itself. It describes how this change in perspective can truly impact the way we collect, serve and manage data. By merging product thinking with the data itself, we start treating data consumers as customers, and we try our utmost to provide our customers with experiences that delight.

In his book Inspired , Marty Cagan emphasises three important characteristics behind successful technology products that customers love. They are (in no particular order):

When building and releasing data products, the various domains are expected to adopt a similar way of thinking. From a platform perspective, it is recommended to always deliver something compelling to use.

Within the GDP, this was realised by offering platform services that are easy to use and transparent, consisting of shared responsibility (Bottcher, 2018). More guidance on this can be found opposite:

For example, when the GDP initially released a solution for data quality monitoring and reporting, we asked ourselves:

Is our service intuitive to use?
Do our users have actionable insights available as a result of the service that we deliver, which is essentially DQ monitoring?
What is the scope of user vs. platform responsibility? Are both parties accepting responsibilities?

To successfully manage data-as-a-product independently and securely, data should be treated as a valuable product, and responsible domains should strive to deliver data products of the highest grade.

If we treat data as a by-product, instead of treating data-as-a-product, we will fail to prioritise muchneeded data management at the risk of losing consumer trust. Without consumers, do we really have (data) products? And without (data) products, do we truly have a business?

SELF-SERVICE PLATFORMS

It was highlighted earlier how traditional data architecture often involved a single platform and implementation team. Data mesh draws a clear distinction between the following:

A platform team that focuses on providing technical capabilities to domain teams and the needed infrastructure in each domain
Data domains that focus on individual use cases by building and delivering data products with longterm value

In other words, data mesh platforms should enable the different data domains to build and deliver products of the highest grade completely autonomously. As an outcome, self-service will be enabled with limited, if any, involvement needed from the central platform team.

Sadly, this model is not always reflected in practice. When considering data quality (DQ) management, for example, it is clear that a traditional approach is still prevalent in many organisations today. This approach involves intensive code-based implementations where most DevOps activities are taken care of by central IT. The result could be a turnaround of 1 day (at best) to build, test and deploy a single DQ rule.

A practical example of self-service in action is the GDP’s so-called DQ Rule Builder application.

This front-end application promotes accelerated DQ monitoring and caters for a variety of DQ rules via a user-friendly interface (developed using Microsoft Power Apps). The end-to-end solution gathers user requirements and intelligently converts these into productionised DQ rule logic. This approach has automated many parts of the traditional build/deploy process and resulted in record-level turnaround times for the organisation. As an added benefit, both IT and business were empowered and platform users could essentially start serving themselves.

FEDERATED GOVERNANCE

Self-service without governance can quickly turn into chaos and therefore the final pillar of data mesh includes federated data governance. Data mesh relies on a federated governance model where e.g. capabilities for data management are owned centrally (usually by the platform team) and utilised in a decentralised manner (by the cross-functional domain teams).

What we have noticed in the on-premise world, is that data management capabilities used to be separate and far away from the data itself. Thankfully, the public cloud has enabled us to embed.

Essentially, core platform services have been extended to DG which makes it possible for users to access services for data quality, data lineage, etc. completely out of the proverbial box.

When studying data mesh articles, the word “interoperability” pops up quite frequently. This is defined as “the ability of different systems, applications or products to connect and communicate in a coordinated way, without effort from the end user”. Therefore, DG services should be designed to seamlessly integrate with existing and future data products.

CONCLUSION: TIPPING THE SCALE TOWARDS SCALABLE DATA MANAGEMENT IN THE CLOUD

How are you and your organisation viewing data management? Is it regarded as something controlling, slow and limiting? Or is it frictionless and adaptable whilst at the same time democratised?

In one of his articles, Evan Bottcher shares that truly effective platforms must be “compelling to use”. In other words, it should be easier to consume the platform capabilities than building and maintaining your own thing (Bottcher, 2018).

The story of the Global Data Platform (GDP) illustrates how organisations can effectively tip the scale towards scalable data management in the cloud by creating a compelling force driven by data mesh principles.