Efficient Training on Multiple GPUs
åäžã®GPUã§ã®ãã¬ãŒãã³ã°ãé ãããå Žåããã¢ãã«ã®éã¿ãåäžã®GPUã®ã¡ã¢ãªã«åãŸããªãå Žåãè€æ°ã®GPUã䜿çšããã»ããã¢ãããå¿ èŠãšãªããŸããåäžã®GPUããè€æ°ã®GPUãžã®åãæ¿ãã«ã¯ãã¯ãŒã¯ããŒããåæ£ããããã®ããçš®ã®äžŠååŠçãå¿ èŠã§ããããŒã¿ããã³ãœã«ããŸãã¯ãã€ãã©ã€ã³ã®äžŠååŠçãªã©ãããŸããŸãªäžŠååŠçæè¡ããããŸãããã ãããã¹ãŠã«é©ããäžã€ã®è§£æ±ºçã¯ååšãããæé©ãªèšå®ã¯äœ¿çšããããŒããŠã§ã¢ã«äŸåããŸãããã®èšäºã¯ãããããä»ã®ãã¬ãŒã ã¯ãŒã¯ã«ãé©çšãããäž»èŠãªæŠå¿µã«çŠç¹ãåœãŠã€ã€ãPyTorchããŒã¹ã®å®è£ ã«çŠç¹ãåœãŠãŠããŸãã
泚æ: åäžGPUã»ã¯ã·ã§ã³ ã§çŽ¹ä»ãããå€ãã®æŠç¥ïŒæ··å粟床ãã¬ãŒãã³ã°ãåŸé èç©ãªã©ïŒã¯äžè¬çã§ãããã¢ãã«ã®ãã¬ãŒãã³ã°ã«äžè¬çã«é©çšãããŸãããããã£ãŠããã«ãGPUãCPUãã¬ãŒãã³ã°ãªã©ã®æ¬¡ã®ã»ã¯ã·ã§ã³ã«å ¥ãåã«ãããã確èªããŠãã ããã
ãŸããããŸããŸãª1D䞊ååŠçæè¡ãšãã®å©ç¹ããã³æ¬ ç¹ã«ã€ããŠè©³ãã説æãããããã2Dããã³3D䞊ååŠçã«çµã¿åãããŠããã«é«éãªãã¬ãŒãã³ã°ãå®çŸãããã倧ããªã¢ãã«ããµããŒãããæ¹æ³ãæ€èšããŸããããŸããŸãªä»ã®åŒ·åãªä»£æ¿ææ³ã玹ä»ãããŸãã
Concepts
以äžã¯ããã®ææžã§åŸã§è©³ãã説æãããäž»èŠãªæŠå¿µã®ç°¡åãªèª¬æã§ãã
- DataParallel (DP) - åãã»ããã¢ãããè€æ°åè€è£œãããåã»ããã¢ããã«ããŒã¿ã®ã¹ã©ã€ã¹ãäŸçµŠãããŸããåŠçã¯äžŠè¡ããŠè¡ãããåã»ããã¢ããã¯ãã¬ãŒãã³ã°ã¹ãããã®æåŸã«åæãããŸãã
- TensorParallel (TP) - åãã³ãœã«ã¯è€æ°ã®ãã£ã³ã¯ã«åå²ãããåäžã®GPUã«ãã³ãœã«å šäœãååšããã®ã§ã¯ãªãããã³ãœã«ã®åã·ã£ãŒããæå®ãããGPUã«ååšããŸããåŠçäžã«ãåã·ã£ãŒãã¯å¥ã ã«äžŠè¡ããŠåŠçãããç°ãªãGPUã§åæãããã¹ãããã®æåŸã«çµæãåæãããŸããããã¯æ°Žå¹³äžŠååŠçãšåŒã°ãããã®ã§ãåå²ã¯æ°Žå¹³ã¬ãã«ã§è¡ãããŸãã
- PipelineParallel (PP) - ã¢ãã«ã¯åçŽïŒã¬ã€ã€ãŒã¬ãã«ïŒã«è€æ°ã®GPUã«åå²ãããã¢ãã«ã®åäžãŸãã¯è€æ°ã®ã¬ã€ã€ãŒãåäžã®GPUã«é 眮ãããŸããåGPUã¯ãã€ãã©ã€ã³ã®ç°ãªãã¹ããŒãžã䞊è¡ããŠåŠçãããããã®å°ããªãã£ã³ã¯ã§äœæ¥ããŸãã
- Zero Redundancy Optimizer (ZeRO) - TPãšãããã䌌ããããªãã³ãœã«ã®ã·ã£ãŒãã£ã³ã°ãå®è¡ããŸãããååããŸãã¯åŸåãã®èšç®ã®ããã«ãã³ãœã«å šäœãåæ§ç¯ããããããã¢ãã«ãå€æŽããå¿ èŠã¯ãããŸããããŸããGPUã¡ã¢ãªãå¶éãããŠããå Žåã«è£åããããã®ããŸããŸãªãªãããŒãæè¡ããµããŒãããŸãã
- Sharded DDP - Sharded DDPã¯ãããŸããŸãªZeROå®è£ ã§äœ¿çšãããåºæ¬çãªZeROã³ã³ã»ããã®å¥åã§ãã
åã³ã³ã»ããã®è©³çŽ°ã«æ·±å ¥ãããåã«ã倧èŠæš¡ãªã€ã³ãã©ã¹ãã©ã¯ãã£ã§å€§èŠæš¡ãªã¢ãã«ããã¬ãŒãã³ã°ããéã®å€§ãŸããªæ±ºå®ããã»ã¹ãèŠãŠã¿ãŸãããã
Scalability Strategy
âš ã·ã³ã°ã«ããŒã / ãã«ãGPU
ã¢ãã«ãåäžã®GPUã«åãŸãå ŽåïŒ
- DDP - åæ£ããŒã¿äžŠå
- ZeRO - ç¶æ³ãšäœ¿çšãããæ§æã«å¿ããŠéããã©ãããç°ãªããŸã
ã¢ãã«ãåäžã®GPUã«åãŸããªãå ŽåïŒ
PP
ZeRO
TP
éåžžã«é«éãªããŒãå æ¥ç¶ïŒNVLINKãŸãã¯NVSwitchãªã©ïŒãããã°ããããã®3ã€ã¯ã»ãŒåãé床ã«ãªãã¯ãã§ããããããªãå ŽåãPPã¯TPãŸãã¯ZeROãããéããªããŸããTPã®çšåºŠãå·®ãçãããããããŸãããç¹å®ã®ã»ããã¢ããã§ã®åè ãèŠã€ããããã«å®éšããããšãæåã§ãã
TPã¯ã»ãšãã©ã®å ŽåãåäžããŒãå ã§äœ¿çšãããŸããã€ãŸããTPãµã€ãº <= ããŒãããšã®GPUæ°ã§ãã
æ倧ã®ã¬ã€ã€ãŒãåäžã®GPUã«åãŸããªãå ŽåïŒ
- ZeROã䜿çšããªãå Žå - TPã䜿çšããå¿ èŠããããŸããPPåç¬ã§ã¯åãŸããªãã§ãããã
- ZeROã䜿çšããå Žå - âã·ã³ã°ã«GPUâã®ãšã³ããªãšåããã®ãåç §ããŠãã ãã
âš ãã«ãããŒã / ãã«ãGPU
ããŒãéã®é«éæ¥ç¶ãããå ŽåïŒ
- ZeRO - ã¢ãã«ãžã®ã»ãšãã©ã®å€æŽãäžèŠã§ã
- PP+TP+DP - éä¿¡ãå°ãªããã¢ãã«ãžã®å€§èŠæš¡ãªå€æŽãå¿ èŠã§ã
ããŒãéã®æ¥ç¶ãé ããGPUã¡ã¢ãªããŸã äžè¶³ããŠããå ŽåïŒ
- DP+PP+TP+ZeRO-1
Data Parallelism
2ã€ã®GPUãæã€ã»ãšãã©ã®ãŠãŒã¶ãŒã¯ãDataParallel
ïŒDPïŒãšDistributedDataParallel
ïŒDDPïŒã«ãã£ãŠæäŸããããã¬ãŒãã³ã°é床ã®åäžããã§ã«äº«åããŠããŸãããããã¯ã»ãŒèªæã«äœ¿çšã§ããPyTorchã®çµã¿èŸŒã¿æ©èœã§ããäžè¬çã«ããã¹ãŠã®ã¢ãã«ã§åäœããDDPã䜿çšããããšããå§ãããŸããDPã¯äžéšã®ã¢ãã«ã§å€±æããå¯èœæ§ãããããã§ããPyTorchã®ããã¥ã¡ã³ããŒã·ã§ã³èªäœãDDPã®äœ¿çšãæšå¥šããŠããŸãã
DP vs DDP
DistributedDataParallel
ïŒDDPïŒã¯éåžžãDataParallel
ïŒDPïŒãããé«éã§ãããåžžã«ãããšã¯éããŸããïŒ
- DPã¯Pythonã¹ã¬ããããŒã¹ã§ãããDDPã¯ãã«ãããã»ã¹ããŒã¹ã§ãããã®ãããGILïŒGlobal Interpreter LockïŒãªã©ã®Pythonã¹ã¬ããã®å¶çŽããªãããã§ãã
- äžæ¹ãGPUã«ãŒãéã®é ãçžäºæ¥ç¶æ§ã¯ãDDPã®å Žåã«å®éã«ã¯é ãçµæãããããå¯èœæ§ããããŸãã
以äžã¯ã2ã€ã®ã¢ãŒãéã®GPUééä¿¡ã®äž»ãªéãã§ãïŒ
DDP:
- éå§æãã¡ã€ã³ããã»ã¹ã¯ã¢ãã«ãGPU 0ããä»ã®GPUã«è€è£œããŸãã
- ããããåãããããšã«:
- åGPUã¯åèªã®ãããããã®ããŒã¿ãçŽæ¥æ¶è²»ããŸãã
backward
äžãããŒã«ã«åŸé ãæºåã§ãããšããããã¯ãã¹ãŠã®ããã»ã¹ã§å¹³ååãããŸãã
DP:
åãããããšã«:
- GPU 0ã¯ããŒã¿ããããèªã¿åããããããåGPUã«ããããããéä¿¡ããŸãã
- GPU 0ããåGPUã«ææ°ã®ã¢ãã«ãè€è£œããŸãã
forward
ãå®è¡ããåGPUããGPU 0ã«åºåãéä¿¡ããæ倱ãèšç®ããŸãã- GPU 0ãããã¹ãŠã®GPUã«æ倱ãåæ£ãã
backward
ãå®è¡ããŸãã - åGPUããGPU 0ã«åŸé ãéä¿¡ããããããå¹³ååããŸãã
DDPã¯ãããããšã«è¡ãéä¿¡ã¯åŸé ã®éä¿¡ã®ã¿ã§ãããäžæ¹ãDPã¯ãããããšã«5ã€ã®ç°ãªãããŒã¿äº€æãè¡ããŸãã
DPã¯ããã»ã¹å ã§ããŒã¿ãPythonã¹ã¬ãããä»ããŠã³ããŒããŸãããDDPã¯torch.distributedãä»ããŠããŒã¿ãã³ããŒããŸãã
DPã§ã¯GPU 0ã¯ä»ã®GPUãããã¯ããã«å€ãã®äœæ¥ãè¡ããããGPUã®æªäœ¿çšçãé«ããªããŸãã
DDPã¯è€æ°ã®ãã·ã³éã§äœ¿çšã§ããŸãããDPã®å Žåã¯ããã§ã¯ãããŸããã
DPãšDDPã®ä»ã«ãéãããããŸããããã®è°è«ã«ã¯é¢ä¿ãããŸããã
ããã2ã€ã®ã¢ãŒããæ·±ãç解ãããå Žåããã®èšäºã匷ããå§ãããŸããçŽ æŽããããã€ã¢ã°ã©ã ãå«ã¿ãããŸããŸãªããŒããŠã§ã¢ã§ã®è€æ°ã®ãã³ãããŒã¯ãšãããã¡ã€ã©ã®åºåã瀺ããç¥ã£ãŠããå¿ èŠããããã¹ãŠã®åŸ®åŠãªãã¥ã¢ã³ã¹ã説æããŠããŸãã
å®éã®ãã³ãããŒã¯ãèŠãŠã¿ãŸãããïŒ
Type | NVlink | Time |
---|---|---|
2:DP | Y | 110s |
2:DDP | Y | 101s |
2:DDP | N | 131s |
解æïŒ
ããã§ãDPã¯NVlinkã䜿çšããDDPã«æ¯ã¹ãŠçŽ10ïŒ é ããNVlinkã䜿çšããªãDDPã«æ¯ã¹ãŠçŽ15ïŒ é«éã§ããããšã瀺ãããŠããŸãã
å®éã®éãã¯ãåGPUãä»ã®GPUãšåæããå¿ èŠãããããŒã¿ã®éã«äŸåããŸããåæããããŒã¿ãå€ãã»ã©ãé ããªã³ã¯ãåèšã®å®è¡æéãé ãããå¯èœæ§ãé«ããªããŸãã
以äžã¯å®å šãªãã³ãããŒã¯ã³ãŒããšåºåã§ãïŒ
NCCL_P2P_DISABLE=1
ã䜿çšããŠã察å¿ãããã³ãããŒã¯ã§NVLinkæ©èœãç¡å¹ã«ããŸããã
# DP
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \
python examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
{'train_runtime': 110.5948, 'train_samples_per_second': 1.808, 'epoch': 0.69}
# DDP w/ NVlink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \
torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69}
# DDP w/o NVlink
rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \
torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69}
ããŒããŠã§ã¢: 2x TITAN RTXãå24GB + 2ã€ã®NVLinkïŒnvidia-smi topo -m
㧠NV2
ïŒ
ãœãããŠã§ã¢: pytorch-1.8-to-be
+ cuda-11.0
/ transformers==4.3.0.dev0
ZeRO Data Parallelism
ZeROãã¯ãŒãããŒã¿äžŠååŠçïŒZeRO-DPïŒã¯ã次ã®ããã°æçš¿ã®ãã€ã¢ã°ã©ã ã§èª¬æãããŠããŸãã
ããã¯ç解ãé£ãããããããŸããããå®éã«ã¯ãã®æŠå¿µã¯éåžžã«ã·ã³ãã«ã§ããããã¯éåžžã®DataParallel
ïŒDPïŒã§ãããå®å
šãªã¢ãã«ãã©ã¡ãŒã¿ãåŸé
ãããã³ãªããã£ãã€ã¶ã®ç¶æ
ãè€è£œãã代ããã«ãåGPUã¯ããããã®ã¹ã©ã€ã¹ã®ã¿ãä¿åããŸãããããŠãå®è¡æã«ãç¹å®ã®ã¬ã€ã€ãŒã«å¿
èŠãªå®å
šãªã¬ã€ã€ãŒãã©ã¡ãŒã¿ãå¿
èŠãªå Žåããã¹ãŠã®GPUãåæããŠããäºãã«äžè¶³ããŠããéšåãæäŸããŸããããããã¹ãŠã§ãã
3ã€ã®ã¬ã€ã€ãŒãããªãåçŽãªã¢ãã«ãèããŠã¿ãŸããããåã¬ã€ã€ãŒã«ã¯3ã€ã®ãã©ã¡ãŒã¿ããããŸãïŒ
La | Lb | Lc
---|----|---
a0 | b0 | c0
a1 | b1 | c1
a2 | b2 | c2
ã¬ã€ã€ãŒLaã«ã¯ãéã¿a0ãa1ãããã³a2ããããŸãã
3ã€ã®GPUãããå ŽåãSharded DDPïŒ= Zero-DPïŒã¯ã¢ãã«ã3ã€ã®GPUã«æ¬¡ã®ããã«åå²ããŸãïŒ
GPU0:
La | Lb | Lc
---|----|---
a0 | b0 | c0
GPU1:
La | Lb | Lc
---|----|---
a1 | b1 | c1
GPU2:
La | Lb | Lc
---|----|---
a2 | b2 | c2
ããã¯ãå žåçãªãã£ãŒããã¥ãŒã©ã«ãããã¯ãŒã¯ïŒDNNïŒã®ãã€ã¢ã°ã©ã ãæ³åãããšããã³ãœã«äžŠååŠçãšåæ§ã®æ°Žå¹³ã¹ã©ã€ã¹ã§ãããããªãã®ã§ããåçŽã¹ã©ã€ã¹ã¯ãç°ãªãGPUã«å®å šãªå±€ã°ã«ãŒããé 眮ããæ¹æ³ã§ããããããããã¯åãªãåºçºç¹ã«éããŸããã
ãããããåGPUã¯éåžžã®ããŒã¿äžŠååŠçïŒDPïŒãšåæ§ã«ãéåžžã®ããããããåãåããŸãïŒ
x0 => GPU0
x1 => GPU1
x2 => GPU2
æåã«ãå ¥åããŒã¿ã¯ã¬ã€ã€ãŒLaã«é©çšãããŸãã
GPU0ã«çŠç¹ãåœãŠãŸãããïŒx0ã¯ããã®ååããã¹ãå®è¡ããããã«a0ãa1ãa2ã®ãã©ã¡ãŒã¿ãå¿ èŠã§ãããGPU0ã«ã¯a0ãããããŸãããGPU1ããa1ããGPU2ããa2ãåãåããã¢ãã«ã®åéšåããŸãšããŸãã
åæ§ã«ãGPU1ã¯ãããããx1ãåãåããa1ããæã£ãŠããŸããããa0ãša2ã®ãã©ã¡ãŒã¿ãå¿ èŠã§ãããããã¯GPU0ãšGPU2ããååŸããŸãã
GPU2ãx2ãåãåããŸããa0ãša1ã¯GPU0ãšGPU1ããåãåããa2ãšãšãã«å®å šãªãã³ãœã«ãåæ§ç¯ããŸãã
3ã€ã®GPUã¯å®å šãªãã³ãœã«ãåæ§ç¯ããååãèšç®ãè¡ãããŸãã
èšç®ãå®äºãããšãäžèŠã«ãªã£ãããŒã¿ã¯åé€ãããŸããèšç®äžã ã䜿çšãããåæ§ç¯ã¯äºåã«ãã§ããã䜿çšããŠå¹ççã«è¡ãããŸãã
ãããŠããã®ããã»ã¹å šäœãã¬ã€ã€ãŒLbã次ã«ååãã§LcããããŠéæ¹åã§Lc -> Lb -> Laã«å¯ŸããŠç¹°ãè¿ãããŸãã
ç§ã«ãšã£ãŠãããã¯å¹ççãªã°ã«ãŒãã§ã®éã¿ã®åæ£æŠç¥ã®ããã«èãããŸãïŒ
- 人Aã¯ãã³ããæã£ãŠããŸãã
- 人Bã¯ã¹ããŒããæã£ãŠããŸãã
- 人Cã¯æ§ãæã£ãŠããŸãã
ä»ã圌ãã¯æ¯æ©æã£ãŠãããã®ãå ±æããä»ã®äººããæã£ãŠããªããã®ãããããæã«ã¯å²ãåœãŠãããã¿ã€ãã®ã®ã¢ãè©°ããŠæ ãç¶ããŸãããããSharded DDP / Zero DPã§ãã
ãã®æŠç¥ããå人ãç¬èªã®ãã³ããã¹ããŒããæ§ãæã£ãŠéã°ãªããã°ãªããªãã·ã³ãã«ãªæŠç¥ãšæ¯èŒããŠã¿ãŠãã ããããããPyTorchã®DataParallelïŒDPããã³DDPïŒã§ãã
ãã®ãããã¯ã®æç®ãèªãéã«ã以äžã®é¡çŸ©èªã«åºäŒããããããŸããïŒShardedãPartitionedã
ZeROãã¢ãã«ã®éã¿ãåå²ããæ¹æ³ã«æ³šæãæããšãããã¯ãã³ãœã«ãã©ã¬ãªãºã ãšéåžžã«äŒŒãŠããããã«èŠããŸããããã¯åŸã§è°è«ãããåçŽã¢ãã«ãã©ã¬ãªãºã ãšã¯ç°ãªããåã¬ã€ã€ãŒã®éã¿ãããŒãã£ã·ã§ã³/ã·ã£ãŒãã£ã³ã°ããŸãã
Implementations:
- DeepSpeed ZeRO-DP stages 1+2+3
transformers
integration
Naive Model Parallelism (Vertical) and Pipeline Parallelism
ãã€ãŒãã¢ãã«ãã©ã¬ãªãºã ïŒMPïŒã¯ãã¢ãã«ã®å±€ãè€æ°ã®GPUã«åæ£ãããæ¹æ³ã§ãããã®ã¡ã«ããºã ã¯æ¯èŒçåçŽã§ãåžæããå±€ã.to()
ã¡ãœããã䜿çšããŠç¹å®ã®ããã€ã¹ã«åãæ¿ããã ãã§ããããã«ãããããŒã¿ããããã®å±€ãééãããã³ã«ãããŒã¿ãå±€ãšåãããã€ã¹ã«åãæ¿ããããæ®ãã®éšåã¯å€æŽãããŸããã
ç§ãã¡ã¯ããããåçŽMPããšåŒã³ãŸãããªããªããã»ãšãã©ã®ã¢ãã«ãã©ã®ããã«æãããããæãåºããšãå±€ãåçŽã«ã¹ã©ã€ã¹ããããã§ããããšãã°ã以äžã®å³ã¯8å±€ã®ã¢ãã«ã瀺ããŠããŸãïŒ
=================== ===================
| 0 | 1 | 2 | 3 | | 4 | 5 | 6 | 7 |
=================== ===================
gpu0 gpu1
æã ã¯ãã¢ãã«ãåçŽã«2ã€ã«åå²ããã¬ã€ã€ãŒ0ãã3ãGPU0ã«é 眮ããã¬ã€ã€ãŒ4ãã7ãGPU1ã«é 眮ããŸããã
ããŒã¿ãã¬ã€ã€ãŒ0ãã1ã1ãã2ã2ãã3ã«ç§»åããéã¯éåžžã®ã¢ãã«ãšåãã§ããããããããŒã¿ãã¬ã€ã€ãŒ3ããã¬ã€ã€ãŒ4ã«ç§»åããå¿ èŠãããå ŽåãGPU0ããGPU1ãžã®ç§»åãçºçããéä¿¡ã®ãªãŒããŒããããçºçããŸããåå ããŠããGPUãåãã³ã³ãã¥ãŒãããŒãïŒäŸïŒåãç©çãã·ã³ïŒã«ããå Žåããã®ã³ããŒã¯éåžžã«é«éã§ãããç°ãªãã³ã³ãã¥ãŒãããŒãïŒäŸïŒè€æ°ã®ãã·ã³ïŒã«ããå Žåãéä¿¡ã®ãªãŒããŒãããã¯å€§å¹ ã«å¢å ããå¯èœæ§ããããŸãã
ãã®åŸãã¬ã€ã€ãŒ4ãã5ã6ãã7ãŸã§ã¯éåžžã®ã¢ãã«ãšåæ§ã«åäœãã7çªç®ã®ã¬ã€ã€ãŒãå®äºãããšãããŒã¿ããã°ãã°ã¬ã€ã€ãŒ0ã«æ»ãå¿ èŠããããŸãïŒãŸãã¯ã©ãã«ãæåŸã®ã¬ã€ã€ãŒã«éä¿¡ããŸãïŒãããã§æ倱ãèšç®ãããªããã£ãã€ã¶ãäœæ¥ãéå§ã§ããŸãã
åé¡ç¹ïŒ
- äž»ãªæ¬ ç¹ãããã³ãªãããããåçŽãªãMPãšåŒã¶ã®ãã¯ã1ã€ãé€ããŠãã¹ãŠã®GPUãã©ããªç¬éã§ãã¢ã€ãã«ç¶æ ã§ããããšã§ãããããã£ãŠã4ã€ã®GPUã䜿çšããå ŽåãåçŽãªMPã¯ã1ã€ã®GPUã®ã¡ã¢ãªå®¹éã4åã«ããã®ãšã»ãŒåãã§ãããããŒããŠã§ã¢ã®æ®ããç¡èŠããŸããããã«ãããŒã¿ã®ã³ããŒã®ãªãŒããŒããããããããšãå¿ããŠã¯ãããŸããããããã£ãŠã4æã®6GBã®ã«ãŒãã¯ãããŒã¿ã®ã³ããŒã®ãªãŒããŒãããããªã1æã®24GBã®ã«ãŒããšåããµã€ãºãå容ã§ããã§ãããããåŸè ã¯ãã¬ãŒãã³ã°ãããè¿ éã«å®äºããŸãããã ããããšãã°40GBã®ã«ãŒããããã45GBã®ã¢ãã«ãåããå¿ èŠãããå ŽåãåŸé ãšãªããã£ãã€ã¶ã®ç¶æ ã®ããã«ã»ãšãã©åããããšãã§ããŸããã
- å ±æã®åã蟌ã¿ã¯ãGPUéã§ã³ããŒããå¿ èŠããããããããŸããã
ãã€ãã©ã€ã³äžŠååŠçïŒPPïŒã¯ãã»ãŒåçŽãªMPãšåãã§ãããGPUãã¢ã€ãã«ç¶æ ã«ãªãåé¡ã解決ããå ¥åãããããã€ã¯ããããã«åå²ãããã€ãã©ã€ã³ã人工çã«äœæããããšã«ãããç°ãªãGPUãèšç®ããã»ã¹ã«åæã«åå ã§ããããã«ããŸãã
以äžã¯ãGPipeè«æããã®å³ã§ãäžéšã«ã¯åçŽãªMPãäžéšã«ã¯PPã瀺ãããŠããŸãïŒ
ãã®å³ãããPPãGPUãã¢ã€ãã«ç¶æ ã®é åã§ãããããã«ããå°ãªãæã€ããšãããããŸããã¢ã€ãã«ç¶æ ã®éšåã¯ãããã«ããšåŒã°ããŸãã
å³ã®äž¡æ¹ã®éšåã¯ã4ã€ã®GPUããã€ãã©ã€ã³ã«åå ããŠãã4ã®æ¬¡å ã®äžŠåæ§ã瀺ããŠããŸããã€ãŸãã4ã€ã®ãã€ãã¹ããŒãžF0ãF1ãF2ãF3ã®ãã©ã¯ãŒããã¹ããããéé ã®ããã¯ã¯ãŒããã¹B3ãB2ãB1ãB0ããããŸãã
PPã¯èª¿æŽããæ°ãããã€ããŒãã©ã¡ãŒã¿ãå°å
¥ããŸãããã㯠chunks
ã§ãåããã€ãã¹ããŒãžãéããŠé£ç¶ããŠéä¿¡ãããããŒã¿ã®ãã£ã³ã¯ã®æ°ãå®çŸ©ããŸããããšãã°ãäžã®å³ã§ã¯ chunks=4
ã衚瀺ãããŠããŸããGPU0ã¯ãã£ã³ã¯0ã1ã2ã3ïŒF0,0ãF0,1ãF0,2ãF0,3ïŒã§åããã©ã¯ãŒããã¹ãå®è¡ããä»ã®GPUãäœæ¥ãéå§ãå§ããã®ãåŸ
ã£ãŠãããGPU0ã¯ãã£ã³ã¯3ã2ã1ã0ïŒB0,3ãB0,2ãB0,1ãB0,0ïŒã§éé ãã¹ãå®è¡ããŸãã
泚æãã¹ãã¯ãæŠå¿µçã«ã¯ãããåŸé
èç©ã¹ãããïŒGASïŒãšåãã³ã³ã»ããã§ããããšã§ããPyTorch㯠chunks
ã䜿çšããDeepSpeedã¯åããã€ããŒãã©ã¡ãŒã¿ãGASãšåŒã³ãŸãã
chunks
ã®å°å
¥ã«ãããPPã¯ãã€ã¯ããããïŒMBSïŒã®æŠå¿µãå°å
¥ããŸããDPã¯ã°ããŒãã«ããŒã¿ããããµã€ãºããããããã«åå²ããŸãããããã£ãŠãDPã®æ¬¡æ°ã4ã§ãã°ããŒãã«ããããµã€ãºã1024ã®å Žåã4ã€ã®ãããããïŒãããã256ïŒã«åå²ãããŸãïŒ1024/4ïŒããããŠãchunks
ïŒãŸãã¯GASïŒã®æ°ã32ã§ããå Žåããã€ã¯ãããããµã€ãºã¯8ã«ãªããŸãïŒ256/32ïŒãåãã€ãã©ã€ã³ã¹ããŒãžã¯1ã€ã®ãã€ã¯ããããã§äœæ¥ããŸãã
DP + PPã»ããã¢ããã®ã°ããŒãã«ããããµã€ãºãèšç®ããã«ã¯ãmbs*chunks*dp_degree
ïŒ8*32*4=1024
ïŒãè¡ããŸãã
å³ã«æ»ããŸãããã
chunks=1
ã§ããã°ãéå¹çãªåçŽãªMPã«ãªããŸããéåžžã«å€§ã㪠chunks
å€ã䜿çšãããšãéåžžã«å°ããªãã€ã¯ãããããµã€ãºã«ãªããå¹çãããŸãé«ããªããããããŸããããããã£ãŠãGPUã®å¹ççãªå©çšãæ倧åããå€ãèŠã€ããããã«å®éšããå¿
èŠããããŸããããã¯ãããã«ã®ãµã€ãºãæå°éã«ããããšã«å¯Ÿå¿ããããã¹ãŠã®åå GPUã«ãããé«ã䞊è¡GPUå©çšãå¯èœã«ããããã§ãã
2ã€ã®ãœãªã¥ãŒã·ã§ã³ã°ã«ãŒãããããŸããåŸæ¥ã®ãã€ãã©ã€ã³APIãœãªã¥ãŒã·ã§ã³ãšããŠãŒã¶ãŒã®ã¢ãã«ãå€§å¹ ã«å€æŽããå¿ èŠãããããçŸä»£çãªãœãªã¥ãŒã·ã§ã³ã§ãã
åŸæ¥ã®ãã€ãã©ã€ã³APIãœãªã¥ãŒã·ã§ã³ïŒ
- PyTorch
- DeepSpeed
- Megatron-LM
çŸä»£çãªãœãªã¥ãŒã·ã§ã³ïŒ
- Varuna
- Sagemaker
åŸæ¥ã®ãã€ãã©ã€ã³APIãœãªã¥ãŒã·ã§ã³ã®åé¡ç¹ïŒ
- ã¢ãã«ãããªãå€æŽããå¿
èŠããããããPipelineã¯ã¢ãžã¥ãŒã«ã®éåžžã®ãããŒã
nn.Sequential
ã·ãŒã±ã³ã¹ã«åæžã蟌ãå¿ èŠããããã¢ãã«ã®èšèšãå€æŽããããšãå¿ èŠã§ãã - çŸåšãPipeline APIã¯éåžžã«å¶éçã§ããæåã®ãã€ãã©ã€ã³ã¹ããŒãžã«æž¡ãããPythonå€æ°ã®ã»ãããããå Žåãåé¿çãèŠã€ããå¿ èŠããããŸããçŸåšããã€ãã©ã€ã³ã€ã³ã¿ãŒãã§ãŒã¹ã§ã¯ãå¯äžã®ãã³ãœã«ãŸãã¯ãã³ãœã«ã®ã¿ãã«ãå ¥åãšåºåãšããŠèŠæ±ããŠããŸãããããã®ãã³ãœã«ã¯ããããµã€ãºãæåã®æ¬¡å ãšããŠæã£ãŠããå¿ èŠããããŸãããã€ãã©ã€ã³ã¯ãããããããã€ã¯ããããã«åå²ããŸããå¯èœãªæ¹åç¹ã«ã€ããŠã¯ããã¡ãã®è°è«ãè¡ãããŠããŸãïŒhttps://github.com/pytorch/pytorch/pull/50693
- ãã€ãã¹ããŒãžã®ã¬ãã«ã§ã®æ¡ä»¶ä»ãå¶åŸ¡ãããŒã¯äžå¯èœã§ããäŸãã°ãT5ã®ãããªãšã³ã³ãŒããŒãã³ãŒããŒã¢ãã«ã¯ãæ¡ä»¶ä»ããšã³ã³ãŒããŒã¹ããŒãžãåŠçããããã«ç¹å¥ãªåé¿çãå¿ èŠã§ãã
- åã¬ã€ã€ãŒãé 眮ããå¿ èŠãããããã1ã€ã®ã¢ãã«ã®åºåãä»ã®ã¢ãã«ã®å ¥åã«ãªãããã«ããŸãã
VarunaãšSageMakerãšã®å®éšã¯ãŸã è¡ã£ãŠããŸãããã圌ãã®è«æã«ããã°ãäžèšã§è¿°ã¹ãåé¡ã®ãªã¹ããå æãããŠãŒã¶ãŒã®ã¢ãã«ã«ã¯ã¯ããã«å°ããªå€æŽããå¿ èŠãšããªããšå ±åãããŠããŸãã
å®è£ ïŒ
- Pytorch (initial support in pytorch-1.8, and progressively getting improved in 1.9 and more so in 1.10). Some examples
- DeepSpeed
- Megatron-LM has an internal implementation - no API.
- Varuna
- SageMaker - this is a proprietary solution that can only be used on AWS.
- OSLO - ãã®å®è£ ã¯ãHugging Face Transformersã«åºã¥ããŠããŸãã
ð€ Transformersã®ã¹ããŒã¿ã¹: ãã®å·çæç¹ã§ã¯ããããã®ã¢ãã«ãå®å
šãªPPïŒãã€ãã©ã€ã³äžŠååŠçïŒããµããŒãããŠããŸãããGPT2ã¢ãã«ãšT5ã¢ãã«ã¯åçŽãªMPïŒã¢ãã«äžŠååŠçïŒãµããŒããæã£ãŠããŸããäž»ãªé害ã¯ãã¢ãã«ãnn.Sequential
ã«å€æã§ããããã¹ãŠã®å
¥åããã³ãœã«ã§ããå¿
èŠãããããšã§ããçŸåšã®ã¢ãã«ã«ã¯ãå€æãéåžžã«è€éã«ããå€ãã®æ©èœãå«ãŸããŠãããããããåé€ããå¿
èŠããããŸãã
ä»ã®ã¢ãããŒãïŒ
DeepSpeedãVarunaãããã³SageMakerã¯ã亀äºã«ãã€ãã©ã€ã³ãå®è¡ããã³ã³ã»ããã䜿çšããŠããŸããããã§ã¯ãããã¯ã¯ãŒããã¹ãåªå ãããŠããã«ïŒã¢ã€ãã«æéïŒãããã«æå°éã«æããŸãã
Varunaã¯ãæé©ãªã¹ã±ãžã¥ãŒã«ãçºèŠããããã«ã·ãã¥ã¬ãŒã·ã§ã³ã䜿çšããŠã¹ã±ãžã¥ãŒã«ãããã«æ¹åããããšããŸãã
OSLOã¯ãnn.Sequential
ã®å€æãªãã§Transformersã«åºã¥ããã€ãã©ã€ã³äžŠååŠçãå®è£
ããŠããŸãã
Tensor Parallelism
ãã³ãœã«äžŠååŠçã§ã¯ãåGPUããã³ãœã«ã®ã¹ã©ã€ã¹ã®ã¿ãåŠçããå šäœãå¿ èŠãªæäœã®ããã«ã®ã¿å®å šãªãã³ãœã«ãéçŽããŸãã
ãã®ã»ã¯ã·ã§ã³ã§ã¯ãMegatron-LMè«æããã®ã³ã³ã»ãããšå³ã䜿çšããŸãïŒGPUã¯ã©ã¹ã¿ã§ã®å¹ççãªå€§èŠæš¡èšèªã¢ãã«ãã¬ãŒãã³ã°ã
ã©ã®ãã©ã³ã¹ãã©ãŒãã®äž»èŠãªæ§ç¯èŠçŽ ã¯ãå®å
šã«æ¥ç¶ãããnn.Linear
ã«ç¶ãéç·åœ¢ã¢ã¯ãã£ããŒã·ã§ã³GeLU
ã§ãã
Megatronã®è«æã®è¡šèšæ³ã«åŸã£ãŠãè¡åã®ä¹ç®éšåãY = GeLU(XA)
ãšæžãããšãã§ããŸããããã§ãX
ãšY
ã¯å
¥åãã¯ãã«ãšåºåãã¯ãã«ã§ãA
ã¯éã¿è¡åã§ãã
è¡åã®èšç®ãè¡å圢åŒã§èŠããšãè¡åä¹ç®ãè€æ°ã®GPUã§åå²ã§ããæ¹æ³ãç°¡åã«ç解ã§ããŸãïŒ
éã¿è¡åA
ãN
åã®GPUã«å¯ŸããŠåããšã«åå²ãã䞊åã§è¡åä¹ç®XA_1
ããXA_n
ãå®è¡ãããšãN
åã®åºåãã¯ãã«Y_1ãY_2ã...ãY_n
ãåŸãããããããç¬ç«ããŠGeLU
ã«äŸçµŠã§ããŸãïŒ
ãã®åçã䜿çšããŠãæåŸãŸã§åæãå¿ èŠãªããŸãŸãä»»æã®æ·±ãã®MLPãæŽæ°ã§ããŸããMegatron-LMã®èè ã¯ãã®ããã®æçšãªã€ã©ã¹ããæäŸããŠããŸãïŒ
ãã«ããããã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã䞊ååããããšã¯ããã«ç°¡åã§ãããããã¯æ¢ã«è€æ°ã®ç¬ç«ããããããæã£ãŠãããããæ¬è³ªçã«äžŠåã§ãïŒ
ç¹å¥ãªèæ ®äºé ïŒTPã«ã¯éåžžã«é«éãªãããã¯ãŒã¯ãå¿ èŠã§ããããããã£ãŠ1ã€ã®ããŒããè¶ ããŠTPãå®è¡ããªãããšããå§ããããŸãããå®éã«ã¯ã1ã€ã®ããŒãã«4ã€ã®GPUãããå Žåãæ倧ã®TP床æ°ã¯4ã§ããTP床æ°8ãå¿ èŠãªå Žåã¯ãå°ãªããšã8ã€ã®GPUãæã€ããŒãã䜿çšããå¿ èŠããããŸãã
ãã®ã»ã¯ã·ã§ã³ã¯ãå ã®ãã詳现ãªTPã®æŠèŠã«åºã¥ããŠããŸãã by @anton-lã
SageMakerã¯ãããå¹ççãªåŠçã®ããã«TPãšDPãçµã¿åãããŠäœ¿çšããŸãã
代æ¿åïŒ
- DeepSpeedã¯ãããããã³ãœã«ã¹ã©ã€ã·ã³ã°ããšåŒã³ãŸãã詳现ã¯DeepSpeedã®ç¹åŸŽãã芧ãã ããã
å®è£ äŸ:
- Megatron-LMã«ã¯ãã¢ãã«åºæã®å éšå®è£ ããããŸãã
- parallelformersïŒçŸæç¹ã§ã¯æšè«ã®ã¿ïŒã
- SageMaker - ããã¯AWSã§ã®ã¿äœ¿çšã§ãããããã©ã€ãšã¿ãªãªãœãªã¥ãŒã·ã§ã³ã§ãã
- OSLOã«ã¯ãTransformersã«åºã¥ãããã³ãœã«äžŠåå®è£ ããããŸãã
ð€ Transformersã®ç¶æ³:
- ã³ã¢: ãŸã ã³ã¢ã«ã¯å®è£ ãããŠããŸããã
- ãã ããæšè«ãå¿ èŠãªå Žåãparallelformersã¯ã»ãšãã©ã®ã¢ãã«ã«å¯ŸããŠãµããŒããæäŸããŸãããããã³ã¢ã«å®è£ ããããŸã§ãããã䜿çšã§ããŸãããããŠããã¬ãŒãã³ã°ã¢ãŒãããµããŒããããããšãæåŸ ããŠããŸãã
- Deepspeed-Inferenceã§ã¯ãBERTãGPT-2ãããã³GPT-Neoã¢ãã«ãCUDAã«ãŒãã«ããŒã¹ã®é«éæšè«ã¢ãŒãã§ãµããŒãããŠããŸãã詳现ã¯ãã¡ããã芧ãã ããã
DP+PP
DeepSpeedã®ãã€ãã©ã€ã³ãã¥ãŒããªã¢ã«ããã®æ¬¡ã®å³ã¯ãDPãPPãšçµã¿åãããæ¹æ³ã瀺ããŠããŸãã
ããã§éèŠãªã®ã¯ãDPã©ã³ã¯0ãGPU2ãèŠããªãããDPã©ã³ã¯1ãGPU3ãèŠããªãããããšã§ããDPã«ãšã£ãŠãååšããã®ã¯GPU 0 ãš 1 ã®ã¿ã§ããããã®2ã€ã®GPUã®ããã«ããŒã¿ãäŸçµŠããŸããGPU0ã¯PPã䜿çšããŠGPU2ã«äžéšã®è² è·ããç§å¯è£ã«ããªãããŒãããGPU1ãåæ§ã«GPU3ãæ¯æŽã«åŒãå ¥ããŸãã
å次å ã«ã¯å°ãªããšã2ã€ã®GPUãå¿ èŠã§ãã®ã§ãããã§ã¯å°ãªããšã4ã€ã®GPUãå¿ èŠã§ãã
å®è£ äŸ:
ð€ Transformersã®ç¶æ³: ãŸã å®è£ ãããŠããŸãã
DP+PP+TP
ããã«å¹ççãªãã¬ãŒãã³ã°ãè¡ãããã«ã3Dãã©ã¬ãªãºã ã䜿çšããPPãTPãšDPãšçµã¿åãããŸããããã¯æ¬¡ã®å³ã§ç€ºãããŠããŸãã
ãã®å³ã¯3Dãã©ã¬ãªãºã ïŒå ãã©ã¡ãŒã¿ã¢ãã«ãžã®ã¹ã±ãŒãªã³ã°ãšããããã°æçš¿ããååŸããããã®ã§ãããããã®èªã¿ç©ã§ãã
å次å ã«ã¯å°ãªããšã2ã€ã®GPUãå¿ èŠã§ãã®ã§ãããã§ã¯å°ãªããšã8ã€ã®GPUãå¿ èŠã§ãã
å®è£ äŸ:
- DeepSpeed - DeepSpeedã«ã¯ãããã«å¹ççãªDPã§ããZeRO-DPãšåŒã°ãããã®ãå«ãŸããŠããŸãã
- Megatron-LM
- Varuna
- SageMaker
- OSLO
ð€ Transformersã®ç¶æ³: ãŸã å®è£ ãããŠããŸãããPPãšTPããªãããã
ZeRO DP+PP+TP
DeepSpeedã®äž»èŠãªæ©èœã®1ã€ã¯ZeROã§ãããã¯DPã®æ¡åŒµæ©èœã§ããããã«ã€ããŠã¯ãã§ã«ãZeROããŒã¿äžŠååãã§èª¬æãããŠããŸããéåžžãããã¯åç¬ã§åäœããæ©èœã§ãPPãTPã¯å¿ èŠãããŸãããããããPPãšTPãšçµã¿åãããããšãã§ããŸãã
ZeRO-DPãPPãšçµã¿åããããå Žåãéåžžã¯ZeROã¹ããŒãž1ïŒãªããã£ãã€ã¶ãŒã·ã£ãŒãã£ã³ã°ïŒã®ã¿ãæå¹ã«ãªããŸãã
ZeROã¹ããŒãž2ïŒåŸé ã·ã£ãŒãã£ã³ã°ïŒããã€ãã©ã€ã³äžŠååãšçµã¿åãããŠäœ¿çšããçè«çãªå¯èœæ§ã¯ãããŸãããæ§èœã«æªåœ±é¿ãåãŒããŸããåãã€ã¯ããããããšã«åŸé ãã·ã£ãŒãã£ã³ã°ããåã«ãåŸé ãéçŽããããã®è¿œå ã®ãªãã¯ã·ã§ã³ã¹ãã£ãã¿ãŒéèšãå¿ èŠã§ãéä¿¡ãªãŒããŒããããçºçããå¯èœæ§ããããŸãããã€ãã©ã€ã³äžŠååã®æ§è³ªäžãå°ããªãã€ã¯ããããã䜿çšãããèšç®ã®éäžåºŠïŒãã€ã¯ãããããµã€ãºïŒããã©ã³ã¹ã«ããããã€ãã©ã€ã³ããã«ïŒãã€ã¯ããããæ°ïŒãæå°éã«æããããšã«çŠç¹ãåœãŠãããŠããŸãããããã£ãŠããããã®éä¿¡ã³ã¹ãã¯åœ±é¿ãåãŒãã§ãããã
ããã«ãPPã«ã¯éåžžãããå°ãªãå±€ãå«ãŸããŠãããã¡ã¢ãªã®ç¯çŽã¯ããã»ã©å€§ãããããŸãããPPã¯æ¢ã«åŸé ãµã€ãºãã1/PPãã«åæžãããããåŸé ã·ã£ãŒãã£ã³ã°ã®ç¯çŽã¯çŽç²ãªDPãããã¯ããã«éèŠã§ã¯ãããŸããã
ZeROã¹ããŒãž3ãåæ§ã®çç±ã§é©ããŠããŸãã - ããå€ãã®ããŒãééä¿¡ãå¿ èŠã§ãã
ãããŠãZeROãæã£ãŠããã®ã§ãããäžã€ã®å©ç¹ã¯ZeRO-Offloadã§ããããã¯ã¹ããŒãž1ãªããã£ãã€ã¶ãŒã¹ããŒããCPUã«ãªãããŒãã§ããŸãã
å®è£ äŸ:
- Megatron-DeepSpeedãšBigScienceããã®Megatron-Deepspeedã¯ãåè ã®ãªããžããªã®ãã©ãŒã¯ã§ãã
- OSLO
éèŠãªè«æ:
ð€ Transformersã®ç¶æ³: ãŸã å®è£ ãããŠããŸãããPPãšTPããªãããã
FlexFlow
FlexFlowã¯ããããã«ç°ãªãã¢ãããŒãã§äžŠååã®åé¡ã解決ããŸãã
FlexFlowã¯ããµã³ãã«-ãªãã¬ãŒã¿-å±æ§-ãã©ã¡ãŒã¿ã®4D䞊ååãè¡ããŸãã
- ãµã³ãã« = ããŒã¿äžŠååïŒãµã³ãã«åäœã®äžŠååïŒ
- ãªãã¬ãŒã¿ = åäžã®æäœãããã€ãã®ãµãæäœã«äžŠåå
- å±æ§ = ããŒã¿äžŠååïŒé·ãæ¹åã®äžŠååïŒ
- ãã©ã¡ãŒã¿ = ã¢ãã«äžŠååïŒæ¬¡å ã«é¢ä¿ãªããæ°Žå¹³ãŸãã¯åçŽïŒ
äŸ:
- ãµã³ãã«
ã·ãŒã±ã³ã¹é·512ã®10ããããèããŠã¿ãŸãããããããããµã³ãã«æ¬¡å ã§2ã€ã®ããã€ã¹ã«äžŠååãããšã10 x 512ã5 x 2 x 512ã«ãªããŸãã
- ãªãã¬ãŒã¿
å±€æ£èŠåãè¡ãå ŽåããŸãstdãèšç®ãã次ã«meanãèšç®ããããŒã¿ãæ£èŠåã§ããŸãããªãã¬ãŒã¿ã®äžŠååã«ãããstdãšmeanã䞊åã«èšç®ã§ããŸãããããã£ãŠããªãã¬ãŒã¿æ¬¡å ã§2ã€ã®ããã€ã¹ïŒcuda:0ãcuda:1ïŒã«äžŠååãããšãæåã«å ¥åããŒã¿ãäž¡æ¹ã®ããã€ã¹ã«ã³ããŒããcuda:0ã§stdãèšç®ããcuda:1ã§meanãåæã«èšç®ããŸãã
- å±æ§
10ãããã®512é·ããããŸããããããå±æ§æ¬¡å ã§2ã€ã®ããã€ã¹ã«äžŠååãããšã10 x 512ã10 x 2 x 256ã«ãªããŸãã
- ãã©ã¡ãŒã¿
ããã¯ãã³ãœã«ã¢ãã«ã®äžŠååãŸãã¯åçŽãªå±€ããšã®ã¢ãã«ã®äžŠååãšäŒŒãŠããŸãã
ãã®ãã¬ãŒã ã¯ãŒã¯ã®éèŠæ§ã¯ãïŒ1ïŒGPU/TPU/CPU察ïŒ2ïŒRAM/DRAM察ïŒ3ïŒé«éå éšæ¥ç¶/äœéå€éšæ¥ç¶ãªã©ã®ãªãœãŒã¹ãåããããããã¹ãŠãã¢ã«ãŽãªãºã ã«ãã£ãŠèªåçã«æé©åããããšã§ããã©ã®äžŠååãã©ãã§äœ¿çšããããã¢ã«ãŽãªãºã çã«æ±ºå®ããŸãã
éåžžã«éèŠãªåŽé¢ã®1ã€ã¯ãFlexFlowã¯éçã§åºå®ã®ã¯ãŒã¯ããŒããæã€ã¢ãã«ã®ããã«èšèšãããŠãããåçãªåäœãæã€ã¢ãã«ã¯ã€ãã¬ãŒã·ã§ã³ããšã«ç°ãªã䞊ååæŠç¥ã奜ãå Žåãããããšã§ãã
ãããã£ãŠããã®ãã¬ãŒã ã¯ãŒã¯ã®çŽæã¯éåžžã«é åçã§ããéžæããã¯ã©ã¹ã¿ã§30åéã®ã·ãã¥ã¬ãŒã·ã§ã³ãå®è¡ãããã®ç¹å®ã®ç°å¢ãæé©ã«å©çšããããã®æè¯ã®æŠç¥ãæäŸããŸããéšåãè¿œå /åé€/眮æãããšãããã«å¯ŸããŠå®è¡ããŠåæé©åãã©ã³ãäœæããŸãããã®åŸããã¬ãŒãã³ã°ã§ããŸããç°ãªãã»ããã¢ããã«ã¯ç¬èªã®æé©åããããŸãã
ð€ Transformersã®çŸåšã®ç¶æ³: ãŸã çµ±åãããŠããŸããããã§ã«transformers.utils.fxã䜿çšããŠã¢ãã«ãFXãã¬ãŒã¹å¯èœã§ãããããFlexFlowãåäœãããããã«å¿ èŠãªæé ã誰ããèŠã€ããå¿ èŠããããŸãã
Which Strategy To Use When
ããã§ã¯ãã©ã®äžŠååæŠç¥ããã€äœ¿çšãããã®éåžžã«ãããŸããªã¢ãŠãã©ã€ã³ã瀺ããŸããåãªã¹ãã®æåãéåžžãããéãããšãäžè¬çã§ãã
âš åäžGPU
ã¢ãã«ãåäžGPUã«åãŸãå ŽåïŒ
- éåžžã®äœ¿çš
ã¢ãã«ãåäžGPUã«åãŸããªãå ŽåïŒ
- ZeRO + CPUããªãããŒããããªãã·ã§ã³ã§NVMeããªãããŒã
- äžèšã«å ããŠãæ倧ã®ã¬ã€ã€ãŒãåäžGPUã«åãŸããªãå ŽåãMemory Centric TilingïŒè©³çŽ°ã¯ä»¥äžåç §ïŒãæå¹å
æ倧ã®ã¬ã€ã€ãŒãåäžGPUã«åãŸããªãå ŽåïŒ
- ZeROã䜿çšããªãå Žå - TPãæå¹åããå¿ èŠããããŸãããªããªããPPã ãã§ã¯åããããšãã§ããªãããã§ãã
- ZeROã䜿çšããå Žåã¯ãäžèšã®ãåäžGPUãã®ãšã³ããªãšåããã®ãåç §ããŠãã ãã
âš åäžããŒã/ãã«ãGPU
ã¢ãã«ãåäžGPUã«åãŸãå ŽåïŒ
- DDP - åæ£ããŒã¿äžŠå
- ZeRO - ç¶æ³ãšäœ¿çšãããæ§æã«äŸåããŠéããã©ãããç°ãªãããšããããŸã
ã¢ãã«ãåäžGPUã«åãŸããªãå ŽåïŒ
PP
ZeRO
TP
éåžžã«é«éãªããŒãå æ¥ç¶ãNVLINKãŸãã¯NVSwitchã§ããå Žåããããã®ãã¹ãŠã¯ã»ãšãã©åçã®æ§èœã§ãããããããªãå ŽåãPPã¯TPãŸãã¯ZeROãããéããªããŸããTPã®åºŠåããéããçãããããããŸãããç¹å®ã®ã»ããã¢ããã§åè ãèŠã€ããããã«å®éšããã®ãæåã§ãã
TPã¯ã»ãšãã©åžžã«åäžããŒãå ã§äœ¿çšãããŸããã€ãŸããTPãµã€ãº <= ããŒããããã®GPUã§ãã
æ倧ã®ã¬ã€ã€ãŒãåäžGPUã«åãŸããªãå ŽåïŒ
- ZeROã䜿çšããªãå Žå - TPã䜿çšããå¿ èŠããããŸãããªããªããPPã ãã§ã¯åããããšãã§ããªãããã§ãã
- ZeROã䜿çšããå Žåã¯ãäžèšã®ãåäžGPUãã®ãšã³ããªãšåããã®ãåç §ããŠãã ãã
âš ãã«ãããŒã/ãã«ãGPU
é«éãªããŒãéæ¥ç¶ãããå ŽåïŒ
- ZeRO - ã¢ãã«ãžã®ã»ãšãã©ã®å€æŽãäžèŠã§ã
- PP+TP+DP - éä¿¡ãå°ãªããã¢ãã«ã«å€§èŠæš¡ãªå€æŽãå¿ èŠã§ã
é ãããŒãéæ¥ç¶ããããGPUã¡ã¢ãªãå°ãªãå ŽåïŒ
- DP+PP+TP+ZeRO-1