FSDP vs DeepSpeed

Accelerate offers flexibilty of training frameworks, by integrating two extremely powerful tools for distributed training, namely Pytorch FSDP and Microsoft DeepSpeed. The aim of this tutorial is to draw parallels, as well as to outline potential differences, to empower the user to switch seamlessly between these two frameworks.

To switch between the frameworks, we recommend launching code accelerate launch passing in the correct config file with --config_file, or passing in the respective arguments directly for FSDP and DeepSpeed .

Example Accelerate configurations can be found here for DeepSpeed and FSDP, or in the example zoo under “Launch Configurations”

This tutorial is for single-node, multi-GPU, scenarios only.

Configuring Functionalities

Model tensors are split into different GPUs in an attempt to scale up model sizes; this is termed sharding in FSDP, and partitioning in DeepSpeed. FSDP sharding and DeepSpeed ZeRO (partitioning) stages are configured by --fsdp_sharding_strategy, and --zero_stage, respectively. In particular, FSDP FULL_SHARD maps to DeepSpeed ZeRO stage 3; see this comprehensive mapping between FSDP sharding and DeepSpeed ZeRO settings. The below table summarizes and groups similar settings:

Group	Framework	Configuration	Example	Restrictions (if any)
sharding / partitioning	FSDP DeepSpeed	`--fsdp_sharding_strategy` `--zero_stage`	`1` (`FULL_SHARD`) `3`
offload	FSDP DeepSpeed	`--fsdp_offload_params` `--offload_param_device` `--offload_optimizer_device`	`true` `cpu` `cpu`	all or nothing
model loading	FSDP DeepSpeed	`--fsdp_cpu_ram_efficient_loading` `--zero3_init_flag`	`true` `true`	only ZeRO 3
efficient checkpointing	FSDP DeepSpeed	`--fsdp_state_dict_type` `--zero3_save_16bit_model`	`SHARDED_STATE_DICT` `true`	only ZeRO 3
weights prefetching	FSDP DeepSpeed	`--fsdp_forward_prefetch` `--fsdp_backward_prefetch` None	`true` `BACKWARD_PRE`
model	FSDP DeepSpeed	`--fsdp_auto_wrap_policy` `--fsdp_transformer_layer_cls_to_wrap` None	`TRANSFORMER_BASED_WRAP` <Layer Class>	Usually not needed Transparent to user.
parameters summoning	FSDP DeepSpeed	`--fsdp_use_orig_params` None	`true`	required for `torch.compile` Transparent to user
parameters syncing	FSDP DeepSpeed	`--fsdp_sync_module_states` None	`true`
training	FSDP DeepSpeed	None `--gradient_accumulation_steps` `--gradient_clipping`	`auto` `auto`	Transparent to user

For detailed descriptions of the above, refer to Accelerate launch documentation.

To access other DeepSpeed configurations, such as mixed precision settings, you need to pass in a --deepspeed_config_file, see the documentation.

DeepSpeed can be also configured via DeepSpeedPlugin, e.g., DeepSpeedPlugin.zero_stage is equivalent of --zero_stage, and DeepSpeedPlugin.hf_ds_config can be used to pass --deepeed_config_file.

FSDP can be also configured via FullyShardedDataParallelPlugin, e.g., FullyShardedDataParallelPlugin.sharding_strategy is equivalent of --fsdp_sharding_strategy.

Checkpointing

Do note that while FSDP can be configured via --fsdp_state_dict_type to save either full / sharded checkpoints.

For DeepSpeed Zero3, one could pass a --zero3_save_16bit_model true, which conveniently consolidates the model to a single rank and saves; this is the FSDP equivalent of fsdp_state_dict_type: FULL_STATE_DICT.

For large models, consolidating the model to a single rank can be very slow.

For quicker checkpointing, for FSDP use fsdp_state_dict_type: SHARDED_STATE_DICT, and for DeepSpeed Zero3 use the zero_to_fp32.py script to post-convert sharded checkpoints.

Offloading

FSDP only allows all-or-nothing offload (i.e., either offload parameters, gradients, and optimizer, or keep them all in GPU), but DeepSpeed can offload parameters and optimizer differently. Furthermore, DeepSpeed also supports offloading to NVME.

Prefetching

FSDP allows two prefetching configurations --fsdp_forward_prefetch and --fsdp_backward_prefetch to improve overlap of comms / computation at a cost of extra memory, see FSDP documentation. For DeepSpeed, the prefetching will be turned on when needed, and it turns on depending on certain hyper-params like stage3_param_persistence_threshold, stage3_max_reuse_distance, etc, that can be configured for Zero3; accelerate may set these hyper-params automatically if you don’t set those explicitly in the deepspeed config file.

For FSDP set fsdp_backward_prefetch: BACKWARD_PRE for improved throughputs if memory allows.

Model Loading

While FSDP require an explicit --fsdp_cpu_ram_efficient_loading true to activate efficient model loading, transformers will activate the similar feature whenever DeepSpeed Zero3 is used.

For FSDP, whenever setting --fsdp_cpu_ram_efficient_loading true, accelerate will automatically set sync_module_states to true. For RAM efficient loading the weights will be loaded only in a singe rank, and thus requires sync_module_states to broadcast weights to other ranks.

Model

FSDP requires an explicit --fsdp_auto_wrap_policy for the algorithm to decide how to schedule the all-gather and reduce-scatter operations. But for DeepSpeed this is transparent to the user.

For FSDP, simply set fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP. With the latest transformers versions, we try our best to figure out the suitable fsdp_transformer_layer_cls_to_wrap for HF transformers models. However, if you get an error regarding it, please specify this.

Parameters Summoning

FSDP requires an explicit --fsdp_use_orig_params flag if using torch.compile, see the pytorch documenation. For DeepSpeed this is transparent to the user.

For FSDP, when using torch.compile please set fsdp_use_orig_params: True.

Training

Deepspeed requires explicit --gradient_accumulation_steps and --gradient_clipping flags. For FSDP this is transparent to the user.

When using DeepSpeed, set gradient_accumulation_steps: "auto" and gradient_clipping: "auto" to automatically pick up values set in the Accelerator or TrainingArguments (if using transformers).

On Differences in Data Precision Handling

To discuss the how data precision is handled in both FSDP and Deepspeed, it is instructive to first give an overview of how model parameters are handled in these frameworks. Before the model / optimizer parameters are distributed across GPUs, parameter preparation is involved to first “flatten” them to one-dimensional torch.Tensor. The implementation of FSDP / DeepSpeed varies in the respect of the dtype in which these “flattened” parameters are stored, and there are ramifications with regards to how torch.Optimizer allocate their dtypes. The table below outlines the processes for both frameworks; the “Local” column indicates the process occurring at a per-gpu level, therefore any memory overheads by upcasting should be understood to be amortized by the number of gpus used.

As a rule of thumb, for stable training with automatic mixed precision, all the trainable parameters have to be in torch.float32.

Process	Local	Framework	Details
Loading, i.e., `AutoModel.from_pretrained(..., torch_dtype=torch_dtype)`
Preparation, i.e., creation of “flat params”	✅	FSDP DeepSpeed	created in `torch_dtype`. disregards `torch_dtype`, created in `float32`.
Optimizer initialization	✅	FSDP DeepSpeed	creates parameters in `torch_dtype` creates parameters in `float32`
Training Step, i.e, forward, backward, reduction		FSDP DeepSpeed	follows `MixedPrecision` follows `deepspeed_config_file` mixed precision settings.
Optimizer (Pre-Step)	✅	FSDP DeepSpeed	upcasting (if any) to `torch_dtype` upcasted to `float32`
Optimizer (Actual Step)	✅	FSDP DeepSpeed	occurs in `torch_dtype` occurs in `float32`.

Therefore when using DeepSpeed a small number of GPUs, be aware of potentially significant memory overheads due to the upcasting during preperation.

With FSDP, in the absence of mixed precision, it is possible to operate the torch.Optimizer in low precision torch_dtype, which may be helpful when using small number of GPUs.

With mixed precision, FSDP and DeepSpeed will upcast in the model preparation step (c.f. table above). But do note that FSDP will then save checkpoints in the upcasted precision; Deepspeed may still save low precision checkpoints if --zero3_save_16bit_model is specified.

To clarify the above table consider the concrete examples below; the optimizer pre- and actual step combined for brevity. With FSDP it is possible to operate in the two modes shown below, but DeepSpeed can only operate in one.

Framework	Model Loading (`torch_dtype`)	Mixed Precision	Preparation (Local)	Training	Optimizer (Local)
FSDP	bf16	default (none)	bf16	bf16	bf16
FSDP	bf16	bf16	fp32	bf16	fp32
DeepSpeed	bf16	bf16	fp32	bf16	fp32

< > Update on GitHub