How to make LLMs more memory-efficient

Beyond PEFT: from Mixture of Experts to Quantum Tensor Networks

Cristiano De Nobili

Jul 03, 2024

Today's post will focus on some recent methods to make LLMs more memory efficient. The outline will be as follows:

How to make LLMs more memory-efficient?
A bunch of interesting resources;
🚀 Job & Research opportunities, talks, and events in AI.

Let’s start!

How to make LLMs more memory-efficient?

Despite the enormous success of proprietary LLM-based AI assistants such as GPT-4o or Claude 3.5 Sonnet, most startups, companies and institutions have to run their own customised version of open-source models such as Llama-3 70B or Mixtral 8x22B.

However, high-performance models and their immense size pose significant challenges, such as huge training and inference costs, significant power requirements and considerable memory limitations for on-site deployment. To address this problem, many interesting approaches have recently focused on compressing and improving the computational and memory efficiency of LLMs.

The most popular paradigm, called Parameter-efficient Fine-tuning (PEFT), attempts to adapt LLMs by updating a small number of parameters. For example, the well-known Low-Rank Adaptation (LoRA) is a technique that adds a trainable low rank matrix to the frozen pre-trained weights in each layer, reducing the number of trainable parameters. However, there is often a performance gap between PEFT methods and full parameters tuning, where the former approaches are typically underperforming the latter.

Below, I would like to share three interesting recent works that go beyond the traditional LoRA methods (and their quantized version QLoRA) and aim to minimise this accuracy gap by improving the efficiency of full fine tuning:

Sparse Matrix Tuning, Haoze He et al.: the paper identifies most significant sub-matrices in gradient update, updating only these blocks during the fine-tuning process;
Gradient Low-Rank Projection by Beidi Chen, Anima Anandkumar, Yuandong Tian and collaborators: a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA;
Representation Fine-Tuning by Stanford NLP Group: Given that representations (initial data processed by the network layers) encode rich semantic informations, authors develop methods to edit them instead of updating weights by operating on a frozen base model and learn task-specific interventions on hidden representations (GitHub repository).

Keeping up with the growing number of new methods is a challenge. That is why I recommend you consult this repository, Efficient LLMs by Mi Zhang's group, where the most important papers on the topic of memory- and energy-efficient are constantly posted. In addition, I would like to point out these two other memory-friendly methods that cleverly exploit Mixtures of Experts (MoEs) and Quantum Tensor Networks:

MixLoRA: this approach combines the advantages of Mixture of Experts (MoEs) and LoRA at the same time. Authors claim to reduce GPU memory consumption by 41% and latency during the training process by 17% (by Mingjie Tang IDs Lab)
CompactifAI: Quantum-inspired Tensor Networks are used to compress LLMs. On top of quantization, authors reduce the memory size of Llama 7B by 93%, number of parameters by 70%, while accelerating 50% the training and 25% the inference times of the model (by Multiverse Computing)

To complete, it is worth mentioning Berkley's vLLM project, an open-source library for LLM inference and serving.

Interesting resources

Here is a selection of valuable resources:

If you are an AI developer and want to get started with small-scale LLM projects, read this great blog-post series Yoga LLM & Yoga VLM by Vijayasri Iyer;
Another interesting blog-series on LLMs: What We Learned from a Year of Building with LLMs by Eugene Yan and collaborators;
Often, especially when working on European funded AI projects, you are asked to estimate the carbon footprint of the models you are delivering. ML CO2 Impact is an easy to use tool to answer that question.

Opportunities, talks, and events

I share some opportunities from my network that you might find interesting:

🚀 Job opportunities:

Pi School is looking for a legal intern to work on AI compliance for two European projects, mitigating risks and promoting innovation within a sound legal framework and the AI Act (job details).
ContinualIST, a newborn efficient-AI startup, will soon open several positions. Do not miss to look their website;
Agricola Moderna, an Italian vertical farming company, is looking for a Plant Scientists for their R&D department.
AI Freelancers: if you are looking for short/medium term opportunities in Europe fill this form or spread it.

🔬 Research opportunities:

One PhD position in Quantum Analog Computing, and two Postdocs positions in Quantum Optimization at Pasqal and the University of Sherbrooke;
Postdoctoral position at European Space Agency - ESA Φ-lab in AI for Earth Observation and Hydrology. Check Nicolas Longépé’s LinkedIn post;

📚 Other opportunities or events:

Call for startups in Quantum Computing for Earth Observation (QC4EO) launched by ESA (check also Sabrina Ricci’s post);
Talk about Graph Theory for Orchestrating LLM Workflows by Ahmad Albarqawi at Pi School next week!

You can find me on LinkedIn or Twitter, where I share Science & Tech news. If you wish to book a call or view my lectures and courses (technical and non-technical), you can find me here.

How to make LLMs more memory-efficient

Beyond PEFT: from Mixture of Experts to Quantum Tensor Networks

How to make LLMs more memory-efficient?

Interesting resources

Opportunities, talks, and events

Discussion about this post