LoRA: How to Adapt LLMs Efficiently and Without Latency

Jun 21, 2025

If you work with artificial intelligence or are interested in training machine learning models, you have probably heard of LoRA (Low-Rank Adaptation).

This training technique is transforming the way we adapt large language models for specific tasks, offering an efficient and cost-effective alternative to traditional fine-tuning.

Fine-tuning is the process of adjusting a pretrained language model on a large amount of data for a specific task or domain, such as medical, legal, or a particular language (like Portuguese). Instead of training a model from scratch, which is expensive and time-consuming, fine-tuning leverages the knowledge already learned and adapts the model’s behavior to the desired data and objectives, refining its parameters with a smaller, more specific dataset.

What is LoRA?

LoRA, or Low-Rank Adaptation, is a fine-tuning technique that allows adapting large language models (LLMs) for specific tasks without the need to retrain the entire model. Developed by researchers at Microsoft, this approach uses low-rank matrix decomposition to drastically reduce the number of trainable parameters.

Simply put, imagine you have a model with billions of parameters, but you only need to adjust a small part of it for your specific task. LoRA lets you do exactly that by keeping the original model intact and adding only small "adapters" that capture the necessary changes.

💡 Want to try it out on Colab? In the Notebooks section, there are two examples of how to fine-tune the LLaMA 3.2 model with LoRA on a custom dataset: one notebook using the Hugging Face PEFT library, and another using the Unsloth library.

How Does LoRA Work?

The magic of LoRA lies in its mathematical approach. When a neural model processes information, it multiplies input data by weight matrices. LoRA works as follows:

Freezing the original model: The weights of the pretrained model remain fixed, unchanged.
Low-rank decomposition: For each layer you want to adapt, LoRA adds two small matrices (A and B) that, when multiplied, approximate the needed changes to the original weights.
Efficient training: Only these small matrices are trained, drastically reducing the number of trainable parameters.

Comparison between Traditional Fine-Tuning and LoRA

The image above compares two approaches for model adaptation: traditional fine-tuning and the LoRA technique. In traditional fine-tuning (on the left), the model’s linear layer weights, represented by W₀, are directly updated based on new data, which requires rewriting and storing the entire weight matrix.

In LoRA (on the right), the weight matrix W₀ remains frozen, and the model learns two small matrices, A and B, which introduce a lightweight correction to the layer’s output. This correction is added to the original result, allowing the model to be adapted with far fewer parameters. This makes the process lighter, more efficient, and easier to share.

Follow our page on LinkedIn for more content like this! 😉

Main Advantages of LoRA

Resource Efficiency: LoRA can reduce the number of trainable parameters by over 90%, making fine-tuning accessible even without high-performance hardware.
Training Speed: With fewer parameters to update, training time is significantly reduced. Additionally, LoRA can be merged with the base model weights at inference time.
Flexibility: A single base model can be used for multiple tasks by swapping only the small LoRA modules. In other words, we can train multiple LoRA adapters for different tasks without changing the base model.
Competitive Performance: According to the authors, on benchmarks like GLUE, SAMSum, and WikiSQL, models using LoRA achieve the performance of fully fine-tuned models.

Why Should You Use LoRA?

Imagine needing to train custom models for multiple tasks. With traditional fine-tuning, you would have to store and serve entire copies of the model, which is impractical for large and robust models.

With LoRA, you store only small task-specific matrices (for example, 0.01% of the original model size), making it feasible to train and deploy various custom models at low cost.

Moreover, since most parameters remain frozen, gradient computations and optimizer states are significantly smaller.

Practical Implementation

LoRA implementation has become more accessible with libraries like PEFT (Parameter-Efficient Fine-Tuning) from Hugging Face and Unsloth, which facilitate the use of parameter-efficient tuning techniques on large models. The typical process involves:

Base Model Selection: Choosing a suitable pretrained model
LoRA Configuration: Defining which layers to adapt and the matrix rank
Data Preparation: Organizing task-specific training data
Training: Fine-tuning only the LoRA adapters
Validation: Testing performance on the specific domain or task

Conclusion

LoRA represents a paradigm shift in adapting language models. By exploiting the low-rank structure of updates, this technique offers efficiency without sacrificing quality, making it an ideal choice for those working with LLMs in production, research, or resource-constrained projects.

For developers, researchers, and companies seeking to customize AI models without the prohibitive costs of traditional fine-tuning, LoRA offers an efficient and practical path. As the technology continues evolving, we can expect it to become more integrated into the development of personalized AI solutions.

The technique keeps evolving with variations like AdaLoRA (Adaptive LoRA) and QLoRA (Quantized LoRA) promising even more efficiency. The trend is that LoRA will become the standard for model adaptation in scenarios where resources are limited.

Paper: LoRA: Low-Rank Adaptation of Large Language Models

Github: https://github.com/microsoft/LoRA

✨ Join the Machine Learning Summit 2025 from July 16 to 18 and dive into practical applications of Generative AI and LLMs with a focus on ML engineering. Use the code ELISA40 to get 40% off your registration!

Exploring Artificial Intelligence

Discussion about this post