The Breakthrough of 'Adam-mini': Revolutionizing AI Optimization
Written on
The advent of the Optimizer marks a significant advancement in the training of contemporary neural networks.
Introduced in 2017, the Adam Optimizer has become the preferred choice for training large language models (LLMs) in the industry, dominating the landscape with its various iterations.
However, despite its impressive performance, Adam has a notable drawback: Memory inefficiency.
For instance, training an LLM containing 7 billion parameters demands approximately 86 GB of memory. In the case of larger models such as Google PaLM, which encompasses 540 billion parameters, more than 50 GPUs may be required solely for Adam's operations.
Fortunately, recent developments bring promising news!
A research team has unveiled a more efficient variant of Adam, termed Adam-mini.
This innovative optimizer boasts double the memory efficiency and delivers a throughput improvement of 49.6% over AdamW when applied to billion-parameter LLMs.
This article delves deeply into the mechanics of optimizers, tracing their evolution, identifying their limitations, and illustrating how Adam-mini addresses these challenges, heralding a new era for deep learning.
What Are Optimizers?
An optimizer is a computational algorithm that fine-tunes an ML model's parameters (weights and biases) to minimize the Loss function, thereby improving the model's accuracy during training.
To grasp the functioning of modern optimizers, we must first understand the foundational algorithm known as Gradient Descent. Let's explore this concept.
Understanding Gradient Descent
Gradient Descent serves as the cornerstone mathematical optimization technique aimed at minimizing a loss function iteratively for an ML model.
During training, it begins with an initial parameter set and computes the gradient of the loss function for each parameter.
(For those unfamiliar, the gradient refers to a vector of first-order partial derivatives of the loss function concerning the model parameters. It reveals the direction and rate of change of the loss function for each parameter.)
The optimizer then updates the model's parameters iteratively in the opposite direction of the gradient to decrease the loss function.
(This process can be visualized as navigating downhill along the slope of a mountain shaped by the loss function until reaching a valley.)
The algorithm can be mathematically articulated as follows:
Variants of Gradient Descent
The Gradient Descent algorithm has three variants, each differing in how the gradient of the loss function is computed:
- Batch Gradient Descent (BGD)
In this approach, the entire dataset is utilized to calculate the gradient of the loss function concerning the parameters, resulting in a single update of the model parameters.
This method ensures stable convergence but can be slow and memory-intensive for large datasets.
- Stochastic Gradient Descent (SGD)
This variant calculates the gradient using one training data point at a time for each parameter update.
While this method is faster, it introduces noise during gradient updates.
- Mini-Batch Gradient Descent (MBGD)
Here, a subset of the training data (mini-batch) is used to compute the gradient of the loss function for a single parameter update.
This method strikes a balance between the shortcomings of BGD and SGD, yet it still has its limitations, including:
- Uniform learning rates across all parameter updates.
- Fixed learning rates that fail to adapt to the dataset's characteristics.
- The potential for the algorithm to get stuck in local minima for non-convex loss functions.
The Rise of Modern Optimizers
Modern optimizers address the limitations found in Gradient Descent and its variants.
(It's important to note that alternative optimization techniques, such as Particle Swarm Optimization and Bayesian Optimization, exist but do not rely on Gradient Descent.)
Let's delve into some intermediate optimizers that paved the way for Adam and its adaptations (AdamW).
Momentum
Momentum enhances the Stochastic Gradient Descent (SGD) approach by mitigating its tendency to become trapped in local minima.
It introduces a velocity component that accumulates past gradients, providing inertia and ensuring consistent direction during updates.
AdaGrad (Adaptive Gradient)
AdaGrad addresses SGD's limitation of using a fixed learning rate for all model parameters.
Instead, it adapts the learning rate based on parameter frequency, making larger updates for less frequent parameters and smaller updates for those that are updated frequently.
RMSProp (Root Mean Square Propagation)
RMSProp, like AdaGrad, adjusts the learning rate for each parameter based on historical gradients.
However, unlike AdaGrad, which tracks all past squared gradients, RMSProp utilizes an exponentially decaying average of these gradients, maintaining a stable learning rate.
Introducing 'Adam' (Adaptive Moment Estimation)
The Adam Optimizer, released in 2017, has become fundamental in training most neural networks today.
It combines Momentum and RMSProp by monitoring the first and second moment estimates of the gradients.
- First Moment Estimate (Mean of the gradients)
This is akin to Momentum, where an exponentially decaying average of past gradients is maintained.
- Second Moment Estimate (Uncentered Variance of the gradients)
Similar to RMSProp, this maintains an exponentially decaying average of previous squared gradients.
These terms are corrected for bias, leading to the following update rule for Adam.
Advancing 'Adam' with 'AdamW'
In 2019, AdamW emerged as an enhancement to Adam.
This approach separates the Weight decay (L2 regularization) term from the gradient update process, directly incorporating it into Adam's parameter update rule.
This modification enables AdamW to achieve superior generalization performance compared to the standard Adam optimizer.
Today, AdamW is widely adopted across the industry, with Meta's Llama family of LLMs being notably trained using it.
The Arrival of 'Adam-mini'
Despite its advantages, AdamW is not without limitations.
It is memory-intensive, requiring substantial memory to store both first-order and second-order moment estimates—often exceeding twice the model's size.
To address this, techniques like CPU offloading and Sharding (distributing computational tasks across multiple GPUs) have been employed, but they can introduce latency, slowing down training.
Thus, a new approach to enhance AdamW was proposed.
AdamW allocates individual learning rates to each parameter based on gradient moment estimates, resulting in a billion different learning rates for a billion-parameter model.
However, researchers began to question the necessity of unique learning rates for each parameter.
They discovered that the Hessian Matrix of a Transformer exhibits a near-block diagonal structure, where diagonal elements are significantly larger than off-diagonal elements.
(For those unfamiliar, a Hessian Matrix includes second-order partial derivatives of the loss function concerning model parameters, capturing its curvature and aiding in optimization.)
These dense sub-blocks consist of groups of closely related parameters, particularly in the Transformer's components—namely, Query, Key, Value, and MLP layers.
This insight led to the development of Adam-mini.
In this algorithm, model parameters are partitioned into blocks according to the Transformer's Hessian structure.
The Query and Key parameters are divided by heads, while the default PyTorch partitioning applies to all other parameters.
Then, a single learning rate is calculated for all parameters within each block by averaging the second moment estimates.
For instance, in a model with five parameters, AdamW assigns individual learning rates based on second moment estimates.
In contrast, Adam-mini assigns a single learning rate to parameters within each block, resulting in a significant reduction in the total number of learning rates.
The detailed calculation for the second-moment estimate is outlined below.
Evaluating 'Adam-mini'
The performance of Adam-mini stands out across various metrics.
Memory Efficiency
Adam-mini can reduce over 90% of the second-moment estimates used by various LLMs, leading to a 45% to 50% memory savings compared to AdamW.
Throughput
Adam-mini significantly diminishes communication between CPUs and GPUs, thanks to its memory efficiency.
Moreover, its update rules do not add extra computations; they merely average the second-moment estimates, which is a computationally inexpensive process.
This results in Adam-mini achieving 50% higher throughput than AdamW, reducing wall-clock time by 33% for pre-training Llama2–7B.
Performance During LLM Pre-Training
In tests involving open-source LLMs from the Llama and GPT-2 series, Adam-mini matches AdamW's performance while consuming less memory.
Additionally, Adam-mini is not sensitive to hyperparameters and keeps validation loss stable during training.
Performance on LLM Supervised Fine-Tuning
Supervised fine-tuning (SFT) of the pre-trained Llama-2–7B model demonstrates that Adam-mini outperforms in evaluation Perplexity while using less memory.
Similar results indicate higher evaluation rewards when Reinforcement Learning from Human Feedback (RLHF) is employed on the same LLM.
Furthermore, evaluations on the MT-Bench benchmark reveal that Adam-mini surpasses AdamW in every downstream task, showcasing its enhanced chat capabilities.
Performance on Non-LLM Tasks
Researchers have also adapted a version of Adam-mini for non-LLM neural networks.
This version organizes parameters into blocks based on layers or other logical divisions within the model.
Then, second-moment estimates within each block are averaged, with each block assigned a single learning rate.
Adam-mini demonstrates comparable or superior performance to AdamW on all popular non-LLM tasks.
These findings are revolutionary!
They have the potential to enable future researchers to train LLMs (and other deep neural network architectures) with fewer GPUs, thus lowering costs and energy consumption, accelerating ML research, and democratizing AI development.
What do you think of the Adam-mini optimizer? Have you had the chance to implement it in your projects? Share your thoughts in the comments below.
Further Reading
- Research paper titled ‘Adam-mini: Use Fewer Learning Rates To Gain More’ on ArXiv
- GitHub repository containing the implementation of the Adam-mini optimizer
- Research paper titled ‘Adam: A Method for Stochastic Optimization’ on ArXiv
- Research paper titled ‘Decoupled Weight Decay Regularization’ on ArXiv
- Research paper titled ‘Why Transformers Need Adam: A Hessian Perspective’ on ArXiv
- Research paper titled ‘An overview of gradient descent optimization algorithms’ on ArXiv
Stay Connected
Here are my mailing list links if you’d like to stay updated with my work:
- Subscribe to Dr. Ashish Bamania on Gumroad
- Top Tech & AI Writer On Medium | Self-Taught Software Engineer | Emergency Doctor | AIIMS, New Delhi
- [bamaniaashish.gumroad.com](https://bamaniaashish.gumroad.com)
- Byte Surgery | Ashish Bamania | Substack
- A Deep Dive Into The Best Of Software Engineering
- [bytesurgery.substack.com](https://bytesurgery.substack.com)
- Ashish’s Substack | Ashish Bamania | Substack
- Sharing Everything That I Have Learned & Have Been Learning About, Unfiltered.
- [ashishbamania.substack.com](https://ashishbamania.substack.com)
- Get an email whenever Dr. Ashish Bamania publishes
- [bamania-ashish.medium.com](https://bamania-ashish.medium.com)