资源碎片最小化分配
一.环境(Rocky 8.8/openEuler 22.03 , slurm 23.02)
1. Node
Nodename |
N1 |
N2 |
N3 |
N5 |
Number of Sockets |
2 |
2 |
2 |
1 |
Number of Cores per Socket |
4 |
4 |
4 |
4 |
Total Number of Cores |
8 |
8 |
8 |
4 |
Number of Threads (CPUs) per Core |
1 |
1 |
1 |
2 |
Total Number of CPUs |
8 |
8 |
8 |
8 |
2. Partition
PartitionName |
Part001 |
Part003 |
Nodes |
N1/N2/N3 |
N5 |
Default |
YES |
- |
二. Job 运行
1.Job 需求
一个job需要 12个CPUs (12 tasks and 1 CPU per task with no overcommitment).使用作业所需的最少节点数和最少套接字数分配 CPUs,以最大程度地减少集群中allowcated/unallocated CPUs 的碎片.
2. 任务分布
Nodename |
N1 |
N2 |
N3 |
|||
Socket id |
0 |
1 |
0 |
1 |
0 |
1 |
Number of Allocated CPUs |
4 |
4 |
4 |
0 |
0 |
0 |
Number of Tasks |
8 |
4 |
0 |
3. 参数配置
SelectType=select/cons_tres
SelectTypeParameters=CR_Core, CR_CORE_DEFAULT_DIST_BLOCK
4. 执行命令
srun --ntasks=12 sleep 60
三. Log 日志
1. N1
[2024-02-07T17:58:18.063] launch task StepId=25.0 request from UID:0 GID:0 HOST:192.168.100.40 PORT:49022
[2024-02-07T17:58:18.064] task/affinity: lllp_distribution: JobId=25 implicit auto binding: cores,one_thread, dist 8192
[2024-02-07T17:58:18.064] task/affinity: _task_layout_lllp_block: _task_layout_lllp_block
[2024-02-07T17:58:18.064] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [25]: mask_cpu,one_thread, 0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80
[2024-02-07T17:59:18.301] [25.0] done with job
2. N2
[2024-02-07T17:58:16.690] launch task StepId=25.0 request from UID:0 GID:0 HOST:192.168.100.40 PORT:57456
[2024-02-07T17:58:16.690] task/affinity: lllp_distribution: JobId=25 implicit auto binding: cores,one_thread, dist 8192
[2024-02-07T17:58:16.690] task/affinity: _task_layout_lllp_block: _task_layout_lllp_block
[2024-02-07T17:58:16.690] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [25]: mask_cpu,one_thread, 0x01,0x02,0x04,0x08
[2024-02-07T17:59:16.886] [25.0] done with job
四. 总结
通过log日志首先可用确定分配在2个节点上执行了tasks.其次,通过log中的cpu_mask(0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80 /0x01,0x02,0x04,0x08) 可用看出每一个节点上分配的CPUs的位置.