DATA PLATFORMS – HOW CAN THE RIGHT DATA PLATFORM STREAMLINE YOUR ORGANISATION? TARUSH AGGARWAL

Tarush Aggarwal is one of the world’s leading experts in helping organisations leverage data for exponential growth. He holds a degree in Computer Engineering from Carnegie Mellon University, and he became the first Data Engineer on the analytics team at Salesforce. Tarush also led the data function for WeWork before founding the 5x Company in 2020, which supports entrepreneurs in scaling their businesses.

We asked Tarush to share his expert knowledge on how organisations can meet the demands of managing their disparate data sources. Tarush explains the benefits of implementing a scalable, managed data platform that will evolve alongside your organisation.

Stay tuned as we delve into Tarush’s inspiring journey and learn valuable insights from his experience in the ever-evolving world of data science.

How has the data infrastructure landscape developed over the last 10 years, Tarush?

When you look at the history of data infrastructure, it began with the online revolution. All of a sudden, we went from storing data on our own personal devices to storing data on the Cloud. With the advent of Facebook and Google, Cloud companies started collecting massive amounts of customer information, so the need to analyse this information is really where the big data revolution came from.

“It’s becoming more and more important for companies to have the right platform or the right infrastructure in place to make sense of all of this data.”

Along with starting to store information in the Cloud, the second thing which became prevalent is that we started having multiple different services to store this information. It was no longer one company which had all of this information. Today, your average start-up has got 10 different sources of data. This could be your backend databases, marketing data from Facebook Ads, Google Ads, data from your CRM, financial data from Xero or QuickBooks, or even Google sheets mixed with application data from Greenhouse and Lever. The number of different data sources has increased, so this has resulted in a need to centralise data again. We need to decentralise this and make sense of the data.

That’s a quick history of how we got to where we are today, and why it’s becoming more and more important for companies to have the right platform or the right infrastructure in place to make sense of all of this data.

What can companies do to tackle this problem of disparate and convoluted different data sources?

I think there are four core steps when we think about data platforms today… Step one is how do we pull this data from these different data sources into a single place to analyse? Once you have this, you want to store all of this inside a data warehouse, which is structured to store large amounts of data. Modern day warehouses are able to separate storage from computers, which makes them really cost-efficient in being able to store lots and lots of data without racking up large bills.
That’s step two.

For step three, you have all of this raw data; it’s messy and it’s not been structured in a way to answer business questions. You want data modelling to create a clean business layer, which is optimised to answer business questions. We call that the data modelling layer.

Step four is where we want to surface this, information back to users inside our products, and it allows anyone to answer any questions by slicing and dicing data. You would have a simple BI or reporting layer, which would pull this highly structured data which we’ve just created. In general, that’s the core framework of the different layers of infrastructure. As time progresses, we are introducing more and more categories such as reverse CTL, and then there’s observability, augmented analytics, machine learning and AutoML, which are all additional categories. But the four core layers are the ones I just described.

Do you think that in the next few years companies will be able to manage their data more efficiently?

I think what’s happened has been really interesting. If you look at the data space, it’s one of the most fragmented spaces out there. Every different layer of those four layers sees multiple billion dollar companies competing in them. Then, with the few additional categories which are named after, there are 10 different categories. What this really means for end consumers is actually pretty grim, because the space is becoming mainstream and every company needs to get value from data.

For example, imagine walking into Honda, and instead of selling you a Honda Civic, they sell you an engine and you have to build your own car. That’s really what the buying journey looks like today for the end consumers.

Although we’ve made progress and have flexibility in these tools, it hasn’t been easy for companies who don’t have large armies of data teams to actually go and get value from this. The short answer to ‘have things got better’ , is no. What I’m very bullish about in 2023 is a new category in the space called the managed data platform. This ensures that you can focus on the application of data instead of having to worry about setting up the infrastructure. In full transparency, I run a company which is focused on the managed data platform, but I’m trying to be as unbiased as possible.

Can you describe how a managed data platform works and what’s involved?

The goal is not just how we give businesses an end-toend platform across these initial four categories, but also all of the other categories as the company grows in scale. If you look at software engineering it’s a lot more mature.

How does software engineering solve this problem? Well, software engineering has Amazon Web Services and AWS, and if you really think about it, it’s just an umbrella for 50 different services. Amazon owns a lot of these 50 services, but it’s also got a marketplace where you have external services, and the Amazon platform gives you a central place to do certain things like provisioning or setting up templates.

It makes it very easy by giving you a macro platform, which grows all the way from when you build your first product all the way till you are an enterprise, a large customer, and the entire journey in between. We partner with all of the different data vendors out there, so all of the different warehouses, and ingestion, and modelling, and all of these different categories are all inside our ecosystem. We make it very easy to go build your first platform. Initially, you could start from a template, or based on your industry, use case or size you could pick from a template. There will be a B2B template and a template for companies who have fewer SQL capabilities – so a low-code template, or a Web3 company which needs to pull on chain data will have a template.

We help to build your platform and then manage your users, give you all the tools to upgrade your platform as you become a bigger company, and have more advanced use cases and everything in between just like the Amazon web services. In short, we’re trying to be the Amazon web services of the data platform world.

Is this simply a case of you go to your managed platform and you select what you want?

Yes, I think that’s where the magic comes in. We integrate with all of these different vendors at an API level. We provide, manage, build, user manage, and configure managed teams on behalf of these vendors, and make it very easy by removing all of that complexity and giving you a single platform where you can manage multiple vendors at the same time.

On average, it takes companies four months to build a data platform. Today, they have to sign multiple different enterprise contracts, and this involves work by finance, legal product, and billing. Building a platform on 5X today takes about four minutes. I’m not just making that number up, we’ve actually measured it. What we’re talking about is an end-to-end customer experience which is more streamlined and efficient than what exists inside the market today.

How do you think the whole data infrastructure and platform space plays out over the next five years, given where we are?

I think at a fundamental level, abstraction always goes upstream. We’ve seen time and time again that jobs get replaced by more automation. People always think that this is the end of jobs, but inevitably it creates a new category which employs more people, and things always go more upstream. For example, we don’t design chips anymore, we don’t write and see language anymore, and we don’t optimise how our database is run.

All of this happens automatically. Database administrators were replaced by data platform engineers, who are getting replaced by data engineers, who at some point will get replaced by Data Scientists, and so on. When I think about infrastructure, I think we’re at a point where it’s no longer relevant to hire data platform engineers to build your data platform. New categories in this space promise to give you all of these different things, which allows you to focus more of your time. Time on your data modelling, on building your BI on data science, on insights and recommendations; meaning less time worrying about infrastructure and platform which really wasn’t adding business value in the first place. It was just one of the building blocks.

As I see the space evolving, we’re moving away from a lot of the data infrastructure to more of the applications and data, and I think that’s a really exciting part of the journey.

What do you think that means for someone who’s a data engineer now in 5 or 10 years? What do you think they’ll be doing?

I think if you look at what data engineers were doing 5 or 10 years ago, 80% of time was spent on building pipelines and moving data from one place to another. Ingestion and Fivetran, Stitch, and Airy, and all of these different companies in this category came and replaced that. Whereas today, only a very small amount of a data engineer’s time should be spent thinking about pipelines because this should be fully automated. Instead, data engineering is evolving more into the data modelling side, where data engineers can spend most of their time.

This clean business layerwhich the data engineers really build – is ultimately what powers the data products. It gives the Data Scientists the core models they need in order to go and build the insights and recom mendations. It also powers data analysts to go deeper. Data engineering jobs have just moved higher up the abstraction level and they’re more important than they’ve ever been before.

What about Data Scientists? How do you see their role evolving?

Their role is getting more real. A lot of Data Scientists aren’t actually doing any data science, they’re focusing on all of the layers before. I think that if I look at ML, and data science, and MLOps, it’s finally this point where it’s less buzz-wordy and it’s becoming more real. The opportunity to go and join a company and actually do data science work in the next few years starts to become more tangible, so these people will be able to drive outsized returns in terms of the insights and recommendations they can provide for companies.

On the rise of data platforms and managed data platform providers – does that have any implications for how you structure a data team
going forward?

I think a macro trend which we’re seeing in 2023 and moving forward is doing more with less. I think the downturn, or recession, has had a larger than normal impact on data teams. Globally, data teams have been quite affected, and I think in some ways it’s a correction. Some companies over-hire data teams with the big promise of everyone wanting to become data driven, and this is just part of a normal cost correction.

I think for Data Engineers and Data Scientists, being able to be more relevant over a few core areas, instead of having a data analyst, data engineer, a scientist, a data platform engineer, and then someone on MLOps inside every single team was a little bit too much. I think these skills will still exist individually, as I think specialisation is very necessary, but I think your average Data Engineer will be able to do more things on data platforms, more modelling and some level of data science. And vice versa, where your
average Data Scientist will know how to build a stack from scratch.

I think in general, we are going to get a little bit more rounded and do a little bit more with less.

“As I see the space evolving, we’re moving away from a lot of the data infrastructure to more of the applications and data, and I think that’s a really exciting part of the journey.”

Will this differ depending on the size of a company?

I think the general trend which I’m seeing in large organisations that didn’t start off as tech companies, is that they’re the ones struggling right now. By putting more and more people into this problem and creating more and more silos, things just aren’t getting better for them. I think there’s going to be a lot of consolidation in terms of end-to-end platforms going in there, and this could be things like Palantir or Databricks or other platforms. It’s making their jobs much easier because I think at the end of the day, these very large organisations are the ones I see suffering the most in this current landscape. I think there’s a lot of opportunity to rebalance and change.