Skip to content

πŸ•ΈοΈ Parallelism Support

Examples from Aurora

Running some simple examples with different --tp, --cp, and --pp values.

  • TP = 1, CP = 4, PP = 2, DP = 3
$ launch python3 -Wignore -m ezpz.test_dist --tp 1 --cp 4 --pp 2
# ...clipped...
[2024-12-31 15:36:13.333215][INFO][__init__.py:146] - > initializing model parallel with size 1
[2024-12-31 15:36:13.333942][INFO][__init__.py:151] - > initializing context parallel with size 4
[2024-12-31 15:36:13.334476][INFO][__init__.py:156] - > initializing pipeline with size 2
[2024-12-31 15:36:13.334971][INFO][__init__.py:159] - > initializing ddp with size 3
[2024-12-31 15:36:14.402809][INFO][dist.py:846] - [ 0/23]: [cp:0/3][pp:0/1][dp:0/2]
[2024-12-31 15:36:14.402209][INFO][dist.py:846] - [ 3/23]: [cp:3/3][pp:0/1][dp:0/2]
[2024-12-31 15:36:14.402211][INFO][dist.py:846] - [ 1/23]: [cp:1/3][pp:0/1][dp:0/2]
[2024-12-31 15:36:14.402197][INFO][dist.py:846] - [ 7/23]: [cp:3/3][pp:1/1][dp:0/2]
[2024-12-31 15:36:14.402239][INFO][dist.py:846] - [ 4/23]: [cp:0/3][pp:1/1][dp:0/2]
[2024-12-31 15:36:14.402224][INFO][dist.py:846] - [ 5/23]: [cp:1/3][pp:1/1][dp:0/2]
[2024-12-31 15:36:14.402200][INFO][dist.py:846] - [ 9/23]: [cp:1/3][pp:0/1][dp:1/2]
[2024-12-31 15:36:14.402223][INFO][dist.py:846] - [10/23]: [cp:2/3][pp:0/1][dp:1/2]
[2024-12-31 15:36:14.402229][INFO][dist.py:846] - [ 2/23]: [cp:2/3][pp:0/1][dp:0/2]
[2024-12-31 15:36:14.402216][INFO][dist.py:846] - [ 8/23]: [cp:0/3][pp:0/1][dp:1/2]
[2024-12-31 15:36:14.402206][INFO][dist.py:846] - [11/23]: [cp:3/3][pp:0/1][dp:1/2]
[2024-12-31 15:36:14.402208][INFO][dist.py:846] - [21/23]: [cp:1/3][pp:1/1][dp:2/2]
[2024-12-31 15:36:14.402256][INFO][dist.py:846] - [12/23]: [cp:0/3][pp:1/1][dp:1/2]
[2024-12-31 15:36:14.402312][INFO][dist.py:846] - [ 6/23]: [cp:2/3][pp:1/1][dp:0/2]
[2024-12-31 15:36:14.402233][INFO][dist.py:846] - [13/23]: [cp:1/3][pp:1/1][dp:1/2]
[2024-12-31 15:36:14.402255][INFO][dist.py:846] - [14/23]: [cp:2/3][pp:1/1][dp:1/2]
[2024-12-31 15:36:14.402234][INFO][dist.py:846] - [15/23]: [cp:3/3][pp:1/1][dp:1/2]
[2024-12-31 15:36:14.402252][INFO][dist.py:846] - [16/23]: [cp:0/3][pp:0/1][dp:2/2]
[2024-12-31 15:36:14.402235][INFO][dist.py:846] - [17/23]: [cp:1/3][pp:0/1][dp:2/2]
[2024-12-31 15:36:14.402209][INFO][dist.py:846] - [19/23]: [cp:3/3][pp:0/1][dp:2/2]
[2024-12-31 15:36:14.402218][INFO][dist.py:846] - [20/23]: [cp:0/3][pp:1/1][dp:2/2]
[2024-12-31 15:36:14.402243][INFO][dist.py:846] - [22/23]: [cp:2/3][pp:1/1][dp:2/2]
[2024-12-31 15:36:14.402211][INFO][dist.py:846] - [23/23]: [cp:3/3][pp:1/1][dp:2/2]
[2024-12-31 15:36:14.402291][INFO][dist.py:846] - [18/23]: [cp:2/3][pp:0/1][dp:2/2]
  • TP = CP = PP = 2, DP = 3
$ launch python3 -Wignore -m ezpz.test_dist --tp 2 --cp 2 --pp 2 #--dtype=float32
# ...clipped...
[2024-12-31 15:19:37.033562][INFO][__init__.py:146] - > initializing model parallel with size 2
[2024-12-31 15:19:37.034083][INFO][__init__.py:151] - > initializing context parallel with size 2
[2024-12-31 15:19:37.034451][INFO][__init__.py:156] - > initializing pipeline with size 2
[2024-12-31 15:19:37.034792][INFO][__init__.py:159] - > initializing ddp with size 3
# ...clipped...
[2024-12-31 15:19:38.239822][INFO][dist.py:824] - Using device='xpu' with backend='DDP' + 'ccl' for distributed training.
[2024-12-31 15:19:38.240412][INFO][dist.py:846] - [ 0/23]: [cp:0/1][pp:0/1][tp:0/1][dp:0/2]
[2024-12-31 15:19:38.239840][INFO][dist.py:846] - [ 1/23]: [cp:0/1][pp:0/1][tp:1/1][dp:0/2]
[2024-12-31 15:19:38.239826][INFO][dist.py:846] - [ 2/23]: [cp:1/1][pp:0/1][tp:0/1][dp:0/2]
[2024-12-31 15:19:38.239838][INFO][dist.py:846] - [ 4/23]: [cp:0/1][pp:1/1][tp:0/1][dp:0/2]
[2024-12-31 15:19:38.239820][INFO][dist.py:846] - [ 6/23]: [cp:1/1][pp:1/1][tp:0/1][dp:0/2]
[2024-12-31 15:19:38.239835][INFO][dist.py:846] - [ 7/23]: [cp:1/1][pp:1/1][tp:1/1][dp:0/2]
[2024-12-31 15:19:38.239822][INFO][dist.py:846] - [ 8/23]: [cp:0/1][pp:0/1][tp:0/1][dp:1/2]
[2024-12-31 15:19:38.239818][INFO][dist.py:846] - [11/23]: [cp:1/1][pp:0/1][tp:1/1][dp:1/2]
[2024-12-31 15:19:38.239836][INFO][dist.py:846] - [ 3/23]: [cp:1/1][pp:0/1][tp:1/1][dp:0/2]
[2024-12-31 15:19:38.239845][INFO][dist.py:846] - [ 5/23]: [cp:0/1][pp:1/1][tp:1/1][dp:0/2]
[2024-12-31 15:19:38.239829][INFO][dist.py:846] - [ 9/23]: [cp:0/1][pp:0/1][tp:1/1][dp:1/2]
[2024-12-31 15:19:38.239822][INFO][dist.py:846] - [10/23]: [cp:1/1][pp:0/1][tp:0/1][dp:1/2]
[2024-12-31 15:19:38.239831][INFO][dist.py:846] - [12/23]: [cp:0/1][pp:1/1][tp:0/1][dp:1/2]
[2024-12-31 15:19:38.239814][INFO][dist.py:846] - [18/23]: [cp:1/1][pp:0/1][tp:0/1][dp:2/2]
[2024-12-31 15:19:38.239816][INFO][dist.py:846] - [20/23]: [cp:0/1][pp:1/1][tp:0/1][dp:2/2]
[2024-12-31 15:19:38.239827][INFO][dist.py:846] - [23/23]: [cp:1/1][pp:1/1][tp:1/1][dp:2/2]
[2024-12-31 15:19:38.239831][INFO][dist.py:846] - [13/23]: [cp:0/1][pp:1/1][tp:1/1][dp:1/2]
[2024-12-31 15:19:38.239826][INFO][dist.py:846] - [14/23]: [cp:1/1][pp:1/1][tp:0/1][dp:1/2]
[2024-12-31 15:19:38.239856][INFO][dist.py:846] - [15/23]: [cp:1/1][pp:1/1][tp:1/1][dp:1/2]
[2024-12-31 15:19:38.239848][INFO][dist.py:846] - [16/23]: [cp:0/1][pp:0/1][tp:0/1][dp:2/2]
[2024-12-31 15:19:38.239849][INFO][dist.py:846] - [17/23]: [cp:0/1][pp:0/1][tp:1/1][dp:2/2]
[2024-12-31 15:19:38.239814][INFO][dist.py:846] - [19/23]: [cp:1/1][pp:0/1][tp:1/1][dp:2/2]
[2024-12-31 15:19:38.239812][INFO][dist.py:846] - [21/23]: [cp:0/1][pp:1/1][tp:1/1][dp:2/2]
[2024-12-31 15:19:38.239817][INFO][dist.py:846] - [22/23]: [cp:1/1][pp:1/1][tp:0/1][dp:2/2]
  • TP = CP = 2, PP = 1, DP = 6
$ launch python3 -Wignore -m ezpz.test_dist --tp 2 --cp 2
# ...clipped...
[2024-12-31 15:29:21.697491][INFO][__init__.py:146] - > initializing model parallel with size 2
[2024-12-31 15:29:21.698012][INFO][__init__.py:151] - > initializing context parallel with size 2
[2024-12-31 15:29:21.698377][INFO][__init__.py:156] - > initializing pipeline with size 1
[2024-12-31 15:29:21.698745][INFO][__init__.py:159] - > initializing ddp with size 6
# ...clipped...
[2024-12-31 15:29:22.900343][INFO][dist.py:846] - [ 0/23]: [cp:0/1][tp:0/1][dp:0/5]
[2024-12-31 15:29:22.899759][INFO][dist.py:846] - [ 2/23]: [cp:1/1][tp:0/1][dp:0/5]
[2024-12-31 15:29:22.899758][INFO][dist.py:846] - [ 1/23]: [cp:0/1][tp:1/1][dp:0/5]
[2024-12-31 15:29:22.899760][INFO][dist.py:846] - [ 5/23]: [cp:0/1][tp:1/1][dp:1/5]
[2024-12-31 15:29:22.899758][INFO][dist.py:846] - [ 6/23]: [cp:1/1][tp:0/1][dp:1/5]
[2024-12-31 15:29:22.899745][INFO][dist.py:846] - [ 7/23]: [cp:1/1][tp:1/1][dp:1/5]
[2024-12-31 15:29:22.899740][INFO][dist.py:846] - [ 8/23]: [cp:0/1][tp:0/1][dp:2/5]
[2024-12-31 15:29:22.899743][INFO][dist.py:846] - [ 9/23]: [cp:0/1][tp:1/1][dp:2/5]
[2024-12-31 15:29:22.899741][INFO][dist.py:846] - [10/23]: [cp:1/1][tp:0/1][dp:2/5]
[2024-12-31 15:29:22.899741][INFO][dist.py:846] - [11/23]: [cp:1/1][tp:1/1][dp:2/5]
[2024-12-31 15:29:22.899759][INFO][dist.py:846] - [ 3/23]: [cp:1/1][tp:1/1][dp:0/5]
[2024-12-31 15:29:22.899760][INFO][dist.py:846] - [ 4/23]: [cp:0/1][tp:0/1][dp:1/5]
[2024-12-31 15:29:22.899756][INFO][dist.py:846] - [19/23]: [cp:1/1][tp:1/1][dp:4/5]
[2024-12-31 15:29:22.899760][INFO][dist.py:846] - [21/23]: [cp:0/1][tp:1/1][dp:5/5]
[2024-12-31 15:29:22.899777][INFO][dist.py:846] - [12/23]: [cp:0/1][tp:0/1][dp:3/5]
[2024-12-31 15:29:22.899775][INFO][dist.py:846] - [13/23]: [cp:0/1][tp:1/1][dp:3/5]
[2024-12-31 15:29:22.899787][INFO][dist.py:846] - [14/23]: [cp:1/1][tp:0/1][dp:3/5]
[2024-12-31 15:29:22.899791][INFO][dist.py:846] - [15/23]: [cp:1/1][tp:1/1][dp:3/5]
[2024-12-31 15:29:22.899781][INFO][disto.py:846] - [16/23]: [cp:0/1][tp:0/1][dp:4/5]
[2024-12-31 15:29:22.899782][INFO][dist.py:846] - [17/23]: [cp:0/1][tp:1/1][dp:4/5]
[2024-12-31 15:29:22.899798][INFO][dist.py:846] - [18/23]: [cp:1/1][tp:0/1][dp:4/5]
[2024-12-31 15:29:22.899755][INFO][dist.py:846] - [20/23]: [cp:0/1][tp:0/1][dp:5/5]
[2024-12-31 15:29:22.899758][INFO][dist.py:846] - [22/23]: [cp:1/1][tp:0/1][dp:5/5]
[2024-12-31 15:29:22.899758][INFO][dist.py:846] - [23/23]: [cp:1/1][tp:1/1][dp:5/5]
World Size TP CP PL DP
24 1 1 1 24
24 2 1 1 12
24 1 2 1 12
24 1 1 2 12
24 2 2 1 6
24 2 1 2 6
24 1 2 2 6
24 4 1 1 6
24 1 4 1 6
24 1 1 4 6
24 4 2 1 3
24 4 1 2 3
24 2 4 1 3
24 2 1 4 3
24 1 4 2 3
24 1 2 4 3
24 4 2 2 3
24 2 4 2 3
24 2 2 2 3