EVOLUTION OF DATA PLATFORMS: FRANCESCO GADALETA

Francesco Gadaleta PhD is Senior Software Engineer, Chief Data Scientist and Founder of Amethix Technologies. He hosts the popular Data Science at Home podcast, and he’s held key roles in the healthcare, energy, and finance domains. Francesco’s specialisms include advanced machine learning, computer programming, cryptography, and blockchain technology.

In this post, Francesco reflects on the changing role of the data scientist in an organisation. Data platforms have evolved, and will continue to do so at a pace, but how can data scientists help organisations adapt to take advantage of these ongoing advancements? As Francesco explains, the solution isn’t always as simple as integrating the latest platform.

The work of the Data Scientist has changed in recent years.

While it has improved for some, it’s probably degraded and become even more frustrating for others. For example, several of the tasks that Data Scientists performed a decade or so ago are no longer needed because there are better tools, better strategies, better platforms – and definitely better architectures.

Think about Oracle, the likes of PostgreSQL, MS SQL Server, and many others. Over the past couple of decades, these platforms have been the most important systems in the organisations of pretty much any sector, regardless of the size and data types. With an increased demand for more analytical tasks alongside the needs of key decision-makers in companies and organisations, engineers are using different types of architectures and platforms just to fulfill such new requirements. We can list these architectures from the data warehouse to the data lake, data fabric, the lake house and the data mesh. Chronologically, the data warehouse from the sixties and seventies has evolved to become the data mesh and the data lakehouse later. The lakehouse is one of the most recent architectures and evolutions of architectures in data.

Such a new concept overcomes some of the most evident limitations of databases, both structured and unstructured. Ignoring the fact that we have key-value stores, and one can, to a certain extent, manage unstructured data, the idea of the database is based on transactions; to have some sort of consistency, and to also have a schema in the data. This means that once you have an ingestion layer that takes data from a data source into the database, the Engineer has to know in advance what that data looks like. For some organisations, this is still a valid concept. The fact that data architectures evolve doesn’t necessarily mean that one has to migrate to the most novel approach overnight.

There are many organisations where their business (and the type of data they manage) hasn’t changed in a decade or more. The main reason why data lakes have been introduced is to overcome some of the technical limitations of old-school databases. For example, the fact that databases and data marks can’t deal nor process semi-structured data.

The concept of the data lake has its own benefits and limitations like any other data architecture. Probably a key benefit would be the fact that it is based on object storage technology, which is usually low cost.

Historically, proprietary data warehouses have been quite expensive pieces of technology where most of the costs have been due to maintenance performed by expensive consultants provided by the supplier. When it comes to proprietary data and proprietary data formats, there is yet another threat to keep an eye on, which is data lock-in.

Of course, all that glitters is not gold. Data lakes come with some challenges. For example, no support for transactional applications, meaning low reliability and potentially poor data quality.

When you incentivise someone to put their data into one place as a low-cost object storage, there is no schema enforcement, and no concept of transactionso no consistency at all (or what it is usually referred to as eventual consistency in distributed computing). Essentially, this means much lower reliability with respect to the old-school database. What happens to data is that as time goes on, quality starts degrading. That might become one of the most difficult challenges one is called to deal with. One of the best evolutions from the data lake that overcomes such limitations and definitely improves data quality is the data mesh.

The data mesh is, in my opinion, one of the most elegant solutions to evolve from the data warehouse to the data lake – and then to something even better than that. The concepts behind the data mesh are not necessarily new per se. Under the requirements of the data mesh there is a domain expert who is responsible for a particular business unit and the data related to it.

This new role, in a sort of decentralised fashion, owns the data, just like for a data product. There is no longer a centralised way of storing raw transactional data. There is, however, a way of decentralising the ownership of the data.

There is access control and global governance. There is decentralised ownership. There is the concept of the domain data product. I do believe that many organisations will migrate to the data mesh in the near future. I’m not only a big fan of the concept, but also it seems to be the most natural solution to the problems of today and tomorrow. However, there is no standard or product that implements a data mesh. Data mesh is a concept, not a product. While there will be several implementations of the data mesh, there’s currently not a single product that one can purchase anywhere. There are principles that demonstrate best practices of how to put a data mesh together for a business and leverage the particular use case.

Another architecture trying to solve similar problems is the data fabric. The data fabric is characterised by centralised data, it is metadata-driven and it has a technology-centric approach. There are three essential concepts that data fabric is usually built upon, including compliance, governance, and privacy. Looking at the architecture of the data fabric there are consumers on one side, and the entire data life cycle management just below. Below that, one can find the components that implement metadata management and data catalogues, ETL, data visualisation, etc. At the lowest layer sits the actual infrastructure that stores raw data.

Despite the constant evolution of platforms and architectures, there’s no need to adopt them overnight. It goes without saying, there’s no winner, and there’s no loser. There’s just the best approach and the best architecture that suits a particular business .

My take-home message is to pay much more attention to the organisational needs, the analytical workloads, the variety of the data and the data analytics maturity of the entire organisation, rather than committing to the latest technology.