RAPID AND BUDGET-FRIENDLY LLM TRAINING USING LoRA: FRANCESCO GADALETA

Francesco Gadaleta PhD is a seasoned professional in the field of technology, AI and data science. He’s the founder of Amethix Technologies, a firm specialising in advanced data and robotics solutions. He hosts the popular Data Science at Home podcast, and over his illustrious career he’s held key roles in the healthcare, energy, and finance domains. Francesco’s professional interests are diverse, spanning applied mathematics, advanced machine learning, computer programming, robotics, and the study of decentralised and distributed systems.

In this post, Francesco explains how the introduction of LoRA levelled the playing field, making complex LLMs accessible to the mainstream. LLMs used to be expensive and time-consuming to run, but with LoRA, you can train an LLM at a fraction of the time and cost. Francesco deep-dives into LoRA’s origins and methodology, to reveal what makes it so effective.

There are methodologies that actually allow large language models to run and be retrained quickly – and there is a lot of activity and research in this area. There’s one method in particular, LoRA, that reveals the secret behind how large corporations like OpenAI can provide large language models to millions of people, and can retrain these models over and over again.

How can these models stay up to date with whatever is going on in the world, and how is it possible that a 175 billion parameter model can be constantly retrained and fine-tuned with minimal effort? In fact, it’s not actually about fully retraining these models – which is where the secret sauce is. The problem with large language models is the very fact that they are large. 175 billion parameters is a number that we could not conceive of until a few months ago. We were running many of these models at home or in our jobs, work infrastructure, or AWS in the domain of several millions, not billions, number of parameters. With ChatGPT, we broke that ceiling, which is a good thing, because now there is a trend to make the model smaller.

We know that if we increase the number of parameters, the model becomes more powerful. That means we have more data, more parameters, more degrees of freedom, and potentially more sophistication in the answers that large language models (in the case of NLP) can give. But the bigger the model, the less scalable the model is. From a practical perspective, this might be a way to keep other companies out of the competition. Only the big players who have the financial and infrastructure capacity can actually do research that brings results to the world of AI.

Even more so with building massive models. A few years ago, practitioners and researchers made a non-written promise to democratise data science. This was clearly not happening, at least until LoRa came out. Fortunately, mathematics and computer science come to the rescue. LoRA is one of the most important methodologies behind the powerhouse of such big models, and is the reason why these models can be fine-tuned constantly with minimal effort. By minimal effort, I mean several orders of magnitude less parameters to be fine-tuned.

It’s worth stating that there’s no magic here, there’s no AGI, there’s no religion behind deep learning and artificial intelligence. There is mathematics and then there is linear algebra, optimisation, and computer science. With this said, let’s get started. LoRA comes from a two-year old paper that was written by researchers at Microsoft. The title of the paper is ‘LoRA, Low-Rank Adaptation of Large Language Models’. Of course, it is built on several other concepts that come from even earlier than 2021.

For example, the concept of transformers is even older than that – it originates from Google, back in 2017. Seeing Google lagging behind in the AI or ChatGPT/LLM Models race when they actually invented the transformer, one of the most used architectures in Large Language Models today, is rather bizarre. For some, this is actually expected when research is made publicly available and paves the way to anyone who can create new tools, methods and models. This is the beauty of research and healthy competition that both keep raising the bar and make people more ambitious.

Remember the size of models from the Computer Vision field of research?

The typical classifier or object recogniser could sport something like a few dozen layers and several million parameters, in order to convert pixels into more and more abstract representations until assigning a label (in the case of an object classifier). Even though we were used to considering such models ‘large’ already back in the day; a concept that was already heavily applied to classify animals or tumors, cars or people with the same model was transfer learning. Transfer learning has been the trick to move from one domain to another (from analysing general purpose images to, for example, medical images) for many years.

It consists of maintaining some of the initial layers or a network ‘untrained’, and retraining the remaining layers until the output. It seemed to work really well, due to the fact that all images share the low-level pixel representation, regardless of their type or domain. However, one issue with such a technique is that applying transfer learning to e.g. 10 domains would require 10 different models, that, except in the first layers, would be completely different. If the space to store large numbers of parameters is no longer a problem for many, the size that such models require in memory (RAM or GPU) suddenly becomes prohibitive.

Back in the days, there was no way to retrain just a tiny bit of the weights of the initial model and proceed with transferring the model to other domains. LoRA solves many of such limitations. Any large model can essentially be extended with just a tiny amount of weights, without being limited on e.g. the length of the input sequence (context or prompt). At the same time, fine tuning such models would not incur any loss of accuracy or increase in inference time. In general, extending a model with additional parameters means having longer predictions at inference time, due to the higher amount of matrix calculations. So, how can one get the best of both worlds, namely not being forced to retrain a model from scratch and, in case of no retrain, not paying a prohibitive cost in terms of accuracy? There can be only one answer: mathematics.

Low Rank Explained

Before explaining how the LoRA methodology works, I need to explain what low-rank means. With the LoRA approach, low-rank adaptation enforces and exploits the concept of low-rank matrix. A low-rank matrix is a matrix where the number of linearly independent rows or columns is much smaller than the number of rows or columns. A matrix with a lot of linearly independent rows and columns, is more difficult to factorise. In a matrix with a number of independent rows and columns much smaller than the size of the matrix itself, there is a lot of room for factorisation in very efficient ways. The curious reader who wants to know more about low-rank matrices and linear algebra, can get access to many sources, from Wikipedia to calculus books. This is such an amazing field, at the core of pretty much any machine learning operation. Not being familiar with linear algebra, one is usually missing the nitty gritty of machine learning algorithms, from logistic regression, to deep learning, to ChapGPT, and anything that can be built on top.

As a matter of fact, LoRA can be applied to any dense layer of models, although the authors initially focused on certain weights in the transformer language models only. In particular, they ran experiments and performance benchmarks on ChatGPT 3 with 175 billion parameters, the most advanced model at the time. As a generic mathematical model, it applies to other models too. Running a neural network in training or inference mode, basically means performing matrix multiplications in the background. During inference, the input is (generally) transformed into a matrix, and such matrix is multiplied with other matrices that represent the inner layers of the network. There are usually hundreds, even thousands of such layers. Hence, the output depends on the specific task (also called downstream task) – that could be a probability, a vector representing probabilities, a label, an index, etc.

A pre-trained weight matrix (usually called W0) has a certain dimension and rank. The key observation from the authors of LoRA is that pre-trained language models usually have a low intrinsic dimension. This means that it is possible to learn efficiently regardless of a random projection of such a matrix to a smaller subspace. In other words, even by projecting to a much smaller dimensional space, these models don’t lose accuracy. Such an intrinsic dimension is usually much lower than the initial dimension. A hidden layer of a neural network is usually represented by a matrix W, also called the weights matrix. During training, such a weight matrix receives constant gradient updates via back propagation and changes accordingly.

Typically one retrains a network, updates the gradients, and back propagates everything again. Researchers found that one can actually constrain the updates of a weight matrix, representing it with its low-rank decomposition (B.A or B dot A, where B and A are multiplied with the same input vector/matrix). Representing a weight matrix W (as an initial pre-trained matrix W0 plus B.A), one can retrain only a subset of all the parameters. This is usually a very tiny number. Therefore, the trainable parameters are lowered to the B.A rather than W0, which stays untouched. To make such a concept even more clear, one essentially trains a vector or a matrix, which is way smaller than the initial trainable weight matrix of the same model. In order to transfer to other domains, or fine tuning the model, requires to retrain just the matrix B.A.

As this is a very quick operation with very little memory overhead, retraining regularly would be absolutely possible. Moreover, as the initial weight matrix would stay basically untouched, large language models would not be affected by storage limitations, due to the fact that the initial weight matrix will be stored once and for all. One can, in principle, apply LoRA to any subset of weight matrices in a neural network in order to reduce the number of trainable parameters. But, in their paper, the authors apply the method only to the transformer architecture.

In particular, the researchers only studied the attention weights for downstream tasks and they froze the multi-layer perceptron layers or modules – meaning these are not retrained in downstream tasks at all. After all they just had to prove that the method works and it can be generalised. To put things in perspective, GPT-3 with 175 billion parameters, might require approximately one terabyte of memory.

Reducing the memory footprint to 300GB would immediately cut the costs of orders of magnitude. By lowering the rank even more and sacrificing accuracy for a different downstream task, one could get about 10,000 times less parameters than what is initially required.

Such an impressive reduction could benefit all smaller companies that need LLMs for their research or core business, while not having access to the financial resources of the bigger players. With LoRA one no longer needs hundreds of costly GPUs, or expensive cloud infrastructure. We all know how costly retraining machine learning models is, especially because they represent trial-and-error science. Last but not least, low-rank matrices are at the core of another essential improvement: researchers measured about 25% increase in speed during training.

Number of trainable parameters and accuracy of GPT-3 trained with different methods

I strongly believe LoRa is one of the most important methods out there since the transformer architecture was introduced. Researchers are more and more interested in reducing the number of parameters of large models, while maintaining the same level of accuracy and power. Observing low rank structures in deep learning is expected. But acknowledging the existence of it and measuring its practical benefits is a completely different story. The LoRA paper finally proves and shows that what was an opinion back in 2018 and 2014 is now a fact. Remember that every time you chat with ChatGPT.