SGD
Stochastic gradient descent (SGD) is a basic gradient descent optimizer to minimize loss given a set of model parameters and updates the parameters in the opposite direction of the gradient. The update is performed on a randomly sampled mini-batch of data from the dataset.
bitsandbytes also supports momentum and Nesterov momentum to accelerate SGD by adding a weighted average of past gradients to the current gradient.
SGD
class bitsandbytes.optim.SGD
< source >( params lr momentum = 0 dampening = 0 weight_decay = 0 nesterov = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )
__init__
< source >( params lr momentum = 0 dampening = 0 weight_decay = 0 nesterov = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )
Parameters
- params (
torch.tensor
) — The input parameters to optimize. - lr (
float
) — The learning rate. - momentum (
float
, defaults to 0) — The momentum value speeds up the optimizer by taking bigger steps. - dampening (
float
, defaults to 0) — The dampening value reduces the momentum of the optimizer. - weight_decay (
float
, defaults to 0.0) — The weight decay value for the optimizer. - nesterov (
bool
, defaults toFalse
) — Whether to use Nesterov momentum. - optim_bits (
int
, defaults to 32) — The number of bits of the optimizer state. - args (
object
, defaults toNone
) — An object with additional arguments. - min_8bit_size (
int
, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization. - percentile_clipping (
int
, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability. - block_wise (
bool
, defaults toTrue
) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
Base SGD optimizer.
SGD8bit
class bitsandbytes.optim.SGD8bit
< source >( params lr momentum = 0 dampening = 0 weight_decay = 0 nesterov = False args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )
__init__
< source >( params lr momentum = 0 dampening = 0 weight_decay = 0 nesterov = False args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )
Parameters
- params (
torch.tensor
) — The input parameters to optimize. - lr (
float
) — The learning rate. - momentum (
float
, defaults to 0) — The momentum value speeds up the optimizer by taking bigger steps. - dampening (
float
, defaults to 0) — The dampening value reduces the momentum of the optimizer. - weight_decay (
float
, defaults to 0.0) — The weight decay value for the optimizer. - nesterov (
bool
, defaults toFalse
) — Whether to use Nesterov momentum. - args (
object
, defaults toNone
) — An object with additional arguments. - min_8bit_size (
int
, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization. - percentile_clipping (
int
, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability. - block_wise (
bool
, defaults toTrue
) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
8-bit SGD optimizer.
SGD32bit
class bitsandbytes.optim.SGD32bit
< source >( params lr momentum = 0 dampening = 0 weight_decay = 0 nesterov = False args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )
__init__
< source >( params lr momentum = 0 dampening = 0 weight_decay = 0 nesterov = False args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )
Parameters
- params (
torch.tensor
) — The input parameters to optimize. - lr (
float
) — The learning rate. - momentum (
float
, defaults to 0) — The momentum value speeds up the optimizer by taking bigger steps. - dampening (
float
, defaults to 0) — The dampening value reduces the momentum of the optimizer. - weight_decay (
float
, defaults to 0.0) — The weight decay value for the optimizer. - nesterov (
bool
, defaults toFalse
) — Whether to use Nesterov momentum. - args (
object
, defaults toNone
) — An object with additional arguments. - min_8bit_size (
int
, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization. - percentile_clipping (
int
, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability. - block_wise (
bool
, defaults toTrue
) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
32-bit SGD optimizer.