LAMB
LAMB (Layerwise adaptive large batch optimization) is an adaptive optimizer designed for training with large batch sizes to accelerate training, combining ideas from LARS
and Adam
to automatically scale the learning rate for each layer:
- calculates a trust ratio between the weight and gradient norm in a layer and clips the ratio to prevent overly large or small updates
- updates weights with the first and second-moments
LAMB
class bitsandbytes.optim.LAMB
< source >( params lr = 0.001 bias_correction = True betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False adam_w_mode = True optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = False max_unorm = 1.0 )
__init__
< source >( params lr = 0.001 bias_correction = True betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False adam_w_mode = True optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = False max_unorm = 1.0 )
Parameters
- params (
torch.tensor
) — The input parameters to optimize. - lr (
float
, defaults to 1e-3) — The learning rate. - bias_correction (
bool
, defaults toTrue
) — Whether to apply bias correction to the first and second-order moments. - betas (
tuple(float, float)
, defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer. - eps (
float
, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer. - weight_decay (
float
, defaults to 1e-2) — The weight decay value for the optimizer. - amsgrad (
bool
, defaults toFalse
) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead. - adam_w_mode (
bool
, defaults toTrue
) — Whether to use the AdamW variant. - optim_bits (
int
, defaults to 32) — The number of bits of the optimizer state. - args (
object
, defaults toNone
) — An object with additional arguments. - min_8bit_size (
int
, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization. - percentile_clipping (
int
, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability. - block_wise (
bool
, defaults toTrue
) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability. - max_unorm (
float
, defaults to 1.0) — The maximum gradient norm.
Base LAMB optimizer.
LAMB8bit
class bitsandbytes.optim.LAMB8bit
< source >( params lr = 0.001 bias_correction = True betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False adam_w_mode = True args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = False max_unorm = 1.0 )
__init__
< source >( params lr = 0.001 bias_correction = True betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False adam_w_mode = True args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = False max_unorm = 1.0 )
Parameters
- params (
torch.tensor
) — The input parameters to optimize. - lr (
float
, defaults to 1e-3) — The learning rate. - bias_correction (
bool
, defaults toTrue
) — Whether to apply bias correction to the first and second-order moments. - betas (
tuple(float, float)
, defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer. - eps (
float
, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer. - weight_decay (
float
, defaults to 1e-2) — The weight decay value for the optimizer. - amsgrad (
bool
, defaults toFalse
) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead. - adam_w_mode (
bool
, defaults toTrue
) — Whether to use the AdamW variant. - args (
object
, defaults toNone
) — An object with additional arguments. - min_8bit_size (
int
, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization. - percentile_clipping (
int
, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability. - block_wise (
bool
, defaults toTrue
) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability. - max_unorm (
float
, defaults to 1.0) — The maximum gradient norm.
8-bit LAMB optimizer.
LAMB32bit
class bitsandbytes.optim.LAMB32bit
< source >( params lr = 0.001 bias_correction = True betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False adam_w_mode = True args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = False max_unorm = 1.0 )
__init__
< source >( params lr = 0.001 bias_correction = True betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False adam_w_mode = True args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = False max_unorm = 1.0 )
Parameters
- params (
torch.tensor
) — The input parameters to optimize. - lr (
float
, defaults to 1e-3) — The learning rate. - bias_correction (
bool
, defaults toTrue
) — Whether to apply bias correction to the first and second-order moments. - betas (
tuple(float, float)
, defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer. - eps (
float
, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer. - weight_decay (
float
, defaults to 1e-2) — The weight decay value for the optimizer. - amsgrad (
bool
, defaults toFalse
) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead. - adam_w_mode (
bool
, defaults toTrue
) — Whether to use the AdamW variant. - args (
object
, defaults toNone
) — An object with additional arguments. - min_8bit_size (
int
, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization. - percentile_clipping (
int
, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability. - block_wise (
bool
, defaults toTrue
) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability. - max_unorm (
float
, defaults to 1.0) — The maximum gradient norm.
32-bit LAMB optimizer.