LLMs, ELMs AND THE SEMANTIC LAYER BY TARUSH AGGARWAL

Tarush Aggarwal is one of the world’s leading experts in helping organisations to leverage data for exponential growth. He earned a degree in Computer Engineering from Carnegie Mellon University, and he became the first Data Engineer on the analytics team at Salesforce. Tarush also led the data function for WeWork before founding the 5x Company in 2020, which supports entrepreneurs in scaling their businesses. We asked Tarush to share his insights on how LLMs will impact the way businesses use their data. Tarush explains the key role ELMs and LLMs will play when positioned on top of the semantic layer. This development, Tarush argues, could lead to a more efficient interface that renders conventional BI tools obsolete.

Can you give us an overview of what you see companies currently doing with Generative AI?

Unless you’ve been living under a rock, generative AI has been massively transformational, and when we think about data and generative AI there haven’t been a lot of obvious use cases. In terms of figuring out how AI will enter the data space, companies are working on a few different things. AI does extremely well in data in certain domains which are closed. For example, writing code using copilot or for content marketing generation. Then there’s Text-to-SQL, where a bunch of different players are thinking about this notion of being able to ask any question of your data by using AI to generate the SQL needed to run that query. Personally, I’m not very bullish on that. When you think about answering a business question, this is extremely open-ended and I don’t think AI has proven to be very effective there yet.

But, what I am very bullish about is being able to plug in generated AI, or smaller expert language models (ELM’s), or a private LLM on top of your semantic layer. Being able to push this on top of your metric definitions and then being able to ask very, very open-ended questions such as “what was revenue like last month?”, “where did it come from?”, “is this channel going up?”, “is this channel going down?”, “is this increase in revenue actually statistically significant?”, or “is it just a sort of one-time blip?”

I think using AI on top of the semantic layer is the most exciting application of AI in the data space today.

Can you define what you mean by the semantic layer?

The semantic layer has been around forever but at a very high level. You have raw data coming into your data warehouse or your data storage layer, and this data is messy and unstructured. It’s typically not built in a way to answer business questions. To model this layer, we clean it, join it, and reframe it in a way which makes sense for the business. The semantic layer is essentially just a definition layer. It allows us to interpret what that data is. The semantic layer would contain what your definitions are. These could be, for example, your definitions of revenue, or what you call an active user, or revenue from a particular channel. It allows us to figure out how to compute one sort of business metric in a way which becomes very consistent.

If anyone wants to compute a metric, they refer to the semantic layer. The semantic layer will give us the definition of that metric from a perspective of SQL or from a coding perspective. We can then have all the consumers who want to consume what that metric is, and instead of reading it directly from the raw data, they would read it from the semantic layer. This means that everyone would have the same definition of that metric, and if you ever wanted to change that, you would just change it in one place and it would propagate into all of your different consumers.

One of the questions I’m thinking about is, “are we approaching the end of BI tools and are they going to be replaced by having an ELM or an LLM on top of your semantic layer” because that’s a better interface? We’re starting to see a few players doing this, and new startups like Delphi Labs are building special products on top of a semantic layer.

The idea is that the end user directly leverages an LLM to get around having to engage with a Data Science department. They can get reports, visualisations or insights from the data on demand when they need it through the LLM connecting directly to the semantic layer and the source data.

The reason you are building a BI dashboard is to answer questions, and that just means that you need to know what those questions are. However, gaining access to information leads to asking better questions. For instance, I get some information, I think about it, and then because of that, I now have a new question. And when I have a new question, I don’t want to go to a data team and say, “Great, I have a new question now.” The issue with putting your LLMs or your ELMs on top of your raw data, is that you aren’t sure if the ELM or the LLM has just made up a business definition. It could be just hallucinating, which is why I like having it on the semantic layer so you can be assured that whenever it computes it, it’s computed in the right way.

In what sense do you see LLM’s stepping into this?

Many years ago, the semantic layer was really part of business intelligence. It was part of what we call ‘BI tools’ or ‘reporting tools’, including the likes of Tableau, and more recently, Looker and Open Source ones like Preset and Light Dash and Sigma. Your metrics used to exist inside your BI tool.

In the last few years, we realised that if they exist

I can now be very specific. I’m not just asking what the revenue was last month, and what was the revenue by channel – but if I see some channels going up and down, I can see if this is significant. Has this been happening periodically? What percentage of that revenue went up from existing customers or new customers? Or what percentage of that went up from existing customers in this new channel, compared to how much money we spend on this channel? Is this channel worth it or am I better off using another channel? These are highly contextual questions which I’m just combining, and that is just not possible in the old paradigm of building reporting inside BI tools.

Are there any specific changes you see happening to semantic layers to get them fit for LLMs?

No. I think at a high level, it’s probably best to pull the semantic layer out from BI, and have it as a standalone layer. It’s not complicated to go and plug in an ELM or an LLM on top – which is where two things come to mind.

Firstly, we are very early in the adoption lifecycle of the semantic layer. DBT Labs launched their semantic layer last year and that was a big failure. They then acquired Transform, and in some ways they’re deprecating the semantic layer and replacing it with Transform. So, one of the biggest companies behind data models got it wrong, and because of that we’ve seen companies like Cube and others launch a semantic layer.

But – if I am able to deploy this layer and I do it right, I can truly build an organisation whereby anyone in the company can ask any question, which is really exciting.

Secondly, we are not talking about using a public version of an LLM (such as using ChatGPT and connecting that to your semantic layer, which is probably not a good idea). There is now a version of ChatGPT which can be deployed for your company, or you can use like an open source ELM. The main difference in ELMs and LLMs, is that ELMs are built on top of your data sets. They’re Open Source, and they’re more focused on your business. Whereas a large scale language model has got a lot more information. I think either can work, but you should use a private version of LLM or your own open source ELM.

I think that what BI tools have done really well for many years is allowing you to control the data. An example of this would be a sales manager of a store only seeing data for their store, but their manager can see data for a number of stores, and then their manager can see it for the city and for the state. We can then filter all of these things inside a BI tool by having different layers of access control.

So, how do we do this with semantic layers?

There are a few different ways. Even though semantic layers are pretty new, they are starting to build out role-based access controls and who accesses different information. Another question is, do you want this in the semantic layer for enterprise-wide adoption, or do you want it inside your AI layer? These are some open questions and I don’t think any of these are really hard problems to solve. But, because we are in the infancy stage, best practice examples of how to do this haven’t yet been clearly defined.

What is required to connect an LLM to the semantic layer to make sure it can navigate the data in a robust way?

I think the first version of this is going to be to export data to a Google sheet, and then put this inside your own private LLM and ask questions on top of that. The issue is again, you don’t have a semantic layer on top. There’s always some fear that it’s hallucinating a definition of your metrics. I wouldn’t recommend going down that route as you want to have it plugged on top of your semantic layer.

Now, you may have the expertise where you can use these APIs to actually go and connect this on top of your semantic layer. As I mentioned, there are some companies that are doing this, Delphi Labs being an example. You can sign up for their service online and just connect your slack on top of your semantic layer, and then you can start asking questions. We’ve seen this inside content marketing, and in writing code with Copilot. There are more and more of these kinds of vertical applications built on top, and I think the next tier of these will be applications like Delphi, which are able to connect your semantic layer through a very easy intuitive interface – whether it’s Slack or whether it’s text-to-speech.

Do you see a clear path for the connection between the LLM and semantic layer in the complex enterprise setting?

I think it’s a really interesting question. The semantic layer implies that you’re talking about a very focused set of questions, which makes sense for your business. The revenue of my business has got nothing to do with the many other things that an LLM knows. Other typical applications, such as generating content, requires a lot of training of the LLM. I think the problem of answering questions, which are very specific to my business, is a much easier domain for generative AI to do, which is why I’m also bullish about actually deploying your own open-source ELM. If you had the technology to connect this on top of the semantic layer, it would still be powerful even without the hundreds of thousands of hours required to train LLMs.

Can you talk a little bit about the advantages and disadvantages of ELMs versus LLMs?

Recently, there was a famous memo leaked from Google, stating they don’t think that they, or OpenAI, had any long-term, sustainable competitive advantage versus the open-source LLM world. I think this goes back to that bigger question around LLMs that require a massive amount of data to train, versus training a smaller ELM on top of your own company’s data sets. The question is, how close can these ELMs (especially for highly contextual use cases), get to an LLM, and how much work is involved in deploying these open-source models to the Last Mile.

It’s easy to go and deploy the model, but the Last Mile means you have to start training it and giving it access to your data, and actually being able to get value from it. On the infrastructure side, there’ll be plenty of companies which are thinking about Last Mile. It’s very difficult for a business today to go and build an AWS, or to go and build a GPT3 because you need a lot of application development. But if it’s not very difficult for a business to deploy their own ELM which they fully control, compared to even a private version of ChatGPT, then it means that you can deploy this very cheaply. We will find out whether this goes one way or the other in the next six to 12 months, but I can see a world where these big companies do not have as big a lead as they thought, and that ELMs become just as powerful.

I think that 99% of companies will be consumers of AI, and 1% of companies will actually create AI. Everyone is now AI-driven, but only a few percentage of companies will actually go and build the infrastructure for it and the majority of companies will just use it. If you build it, obviously you have a much more competitive advantage. But having said that, around five years ago we saw where every single e-commerce company wanted to build their own recommendation engine, and the reality is that 99% of these companies should not be building their own recommendations engine, but they should be using pre-existing tools.

Everyone is now AI-driven, but only a few percentage of companies will actually go and build the infrastructure for it and the majority of companies will just use it.

I see that same analogy inside ELMs versus LLMs. We’re still very early, right?

We haven’t yet defined who the category winners are in many of these areas. If you are able to do this yourself, you have more flexibility on what this looks like, and you can customise it in a way which makes sense for your business.

It’s not very clear that an LLM is a hundred times better than an ELM. It’s just too early to actually see what happens.

How does a company get started with deploying the semantic layer and what advice have you got for them?

I think Cube is our favourite semantic layer right now. It’s actually based on LookML, the semantic layer for Looker. It’s an open-source project, as is the DBT Labs-acquired Transform. Both of these are really good if you want a deployed version. I don’t think you can use a managed version of Transform at the moment, although it will be available later this year. If you need more expertise, then at 5X, we can assemble the entire data layer for you. This is something we could help out with as well.