Abdullah Şamil Güser

LoRA: Low-Rank Adaptation of Large Language Models

The paper “LoRA: Low-Rank Adaptation of Large Language Models” presents a novel approach to adapting large-scale pre-trained language models, like GPT-3, to specific tasks or domains while minimizing the computational and memory costs typically associated with such adaptations.

Introduction

Traditional adaptation of language models relies on fine-tuning, where all parameters of a pre-trained model are updated. This approach becomes increasingly impractical with the growing size of models, such as GPT-3, which has 175 billion parameters. Also, existing techniques often introduce inference latency by extending model depth or reduce the model’s usable sequence length. More importantly, these method often fail to match the fine-tuning baselines, posing a trade-off between efficiency and model quality.

Baselines

Fine-Tuning : This is the conventional method where all parameters of a pre-trained model are updated. It serves as a primary baseline for assessing LoRA’s performance.

Bias-only or BitFit : A method that involves fine-tuning only the bias parameters of a pre-trained model.

Prefix Embedding Tuning : This approach involves optimizing a set of continuous vectors (prefixes) that are prepended to the input sequence, effectively steering the model’s behavior without modifying the pre-existing weights.

Prefix Layer Tuning: Similar to Prefix Embedding, but instead of adding vectors to the input, it adapts a small number of layers at the beginning of the model.

Adapter Tuning : This method inserts trainable layers (adapters) between existing layers of a pre-trained model, allowing for task-specific adjustments without altering the original weights.

Aren’t Existing Solutions Good Enough?

Adapter Layers Introduce Inference Latency : Large neural networks rely on hardware parallelism to keep the latency low, and adapter layers have to be processed sequentially. This problem gets worse when we need to shard the model, because the additional depth requires more synchronous GPU operations such as AllReduce and Broadcast, unless we store the adapter parameters redundantly many times.

Directly Optimizing the Prompt is Hard : Prefix tuning is difficult to optimize and that its performance changes non-monotonically in trainable parameters. More fundamentally, reserving a part of the sequence length for adaptation necessarily reduces the sequence length available to process a downstream task.

Low-Rank Adaptation (LoRA)

LoRA is inspired by the finding that over-parametrized models have a low intrinsic dimension. It adapts pre-trained models by optimizing low-rank decomposition matrices, thereby modifying dense layers indirectly and keeping pre-trained weights frozen. LoRA offers several benefits,

Shared models across tasks
Reduced storage requirements
Higher training efficiency
No additional inference latency
Compatibility with other methods like prefix-tuning

When adapting to a specific task, pre-trained language models have a low “intrinsic dimension” and can still learn efficiently despite a random projection to a smaller subspace.

Inspired by this, we hypothesize the updates to the weights also have a low “intrinsic rank” during adaptation.

For a pre-trained weight matrix $W_0 ∈ ℝ^{dxk}$, we constrain its update by representing the latter with a low-rank decomposition $W_0 + ΔW = W_0 + BA$, where $B ∈ ℝ^d×r$, $A ∈ ℝ^{rxk}$, and the rank $r « min(d, k)$.

During training, $W_0$ is frozen and does not receive gradient updates, while $A$ and $B$ contain trainable parameters. Note both $W_0$ and $ΔW = BA$ are multiplied with the same input, and their respective output vectors are summed coordinate-wise. For $h = W_0x$, our modified forward pass yields:

\[h = W_0x + ΔWx = W_0x + BAx\]

We use a random Gaussian initialization for $A$ and zero for $B$, so $ΔW = BA$ is zero at the beginning of training. We then scale $ΔWx$ by $α/r$, where $α$ is a constant in $r$.

A Generalization of Full Fine-tuning : Training LoRA roughly converges to training the original model, while adapter-based methods converges to an MLP and prefix-based methods to a model that cannot take long input sequences.

No Additional Inference Latency : When deployed in production, we can explicitly compute and store $W = W_0 + BA$ and perform inference as usual. Note that both $W_0$ and $BA$ are in $R^{dxk}$. When we need to switch to another downstream task, we can recover $W_0$ by subtracting $BA$ and then adding a different $B^{‘}A^{‘}$, a quick operation with very little memory overhead.

Practical Benefits and Limitations : It is not straightforward to batch inputs to different tasks with different A and B in a single forward pass, if one chooses to absorb A and B into W to eliminate additional inference latency.

Understanding the Low-Rank Updates

Parameter Budget Efficiency : For a given parameter budget, the study explores which subsets of weight matrices should be adapted using LoRA to maximize downstream performance. They find that adapting both query ($W_q$) and value ($W_v$) weights in the self-attention module gives the best results, suggesting that even a low rank captures sufficient information for effective adaptation.

Optimal Rank Determination : The optimal rank for LoRA is probed, showing that even a rank as small as one can yield competitive performance, indicating that the adaptation matrices have an “intrinsic rank”. This leads to the observation that increasing the rank doesn’t necessarily cover a more meaningful subspace, advocating for the sufficiency of low-rank adaptations.

Subspace Similarity : By measuring the normalized subspace similarity based on the Grassmann distance, the study finds a significant overlap in the top singular vector directions between different ranks, supporting the notion of a low “intrinsic rank”.

Adaptation Matrix vs. Pre-trained Weights : Further investigation reveals that the adaptation matrix ∆W correlates with the pre-trained weights W, amplifying features that are already present in W but not emphasized, hence catering to specific downstream tasks.

Conclusion & Future Work

LoRA is presented as a cost-effective strategy for adapting large language models, addressing the high hardware and storage demands of fine-tuning. This method is efficient, allowing rapid switching between tasks without extra inference latency or reduction in input sequence length, and is suitable for service deployment due to its parameter-sharing capability. The approach, while applied to Transformer models, is broadly relevant to neural networks with dense layers.

Future directions include integrating LoRA with other adaptation methods for further improvements, investigating the transformation processes of features from pre-training to downstream tasks, and developing systematic approaches for selecting weight matrices to apply LoRA. Additionally, the observed rank-deficiency in adaptation matrices opens up new research possibilities, suggesting that the pre-trained weight matrices might also exhibit rank-deficiency, which could lead to more insights and enhancements in model training processes.

References

LoRA: Low-Rank Adaptation of Large Language Models