跨节点均衡分配资源
一.环境(Rocky 8.8/openEuler 22.03 , slurm 23.02)
1. Node
Nodename |
N1 |
N2 |
N3 |
N5 |
Number of Sockets |
2 |
2 |
2 |
1 |
Number of Cores per Socket |
4 |
4 |
4 |
4 |
Total Number of Cores |
8 |
8 |
8 |
4 |
Number of Threads (CPUs) per Core |
1 |
1 |
1 |
2 |
Total Number of CPUs |
8 |
8 |
8 |
8 |
2. Partition
PartitionName |
Part001 |
Part003 |
Nodes |
N1/N2/N3 |
N5 |
Default |
YES |
- |
二. Job 运行
1.Job 需求
一个job需要 9 个CPUs (3 tasks and 3 CPUs per task with no overcommitment). 从默认分区中的 3 个节点中的每一个节点分配 3 个 CPUs.
2. 任务分布
Nodename |
N1 |
N2 |
N3 |
Number of Allocated CPUs |
3 |
3 |
3 |
Number of Tasks |
1 |
1 |
1 |
3. 参数配置
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
4. 执行命令
srun --nodes=3-3 --ntasks=3 --cpus-per-task=3 sleep 60
三. Log 日志
1. N1
[2024-02-07T17:46:13.730] launch task StepId=23.0 request from UID:0 GID:0 HOST:192.168.100.40 PORT:56210
[2024-02-07T17:46:13.730] task/affinity: lllp_distribution: JobId=23 implicit auto binding: cores,one_thread, dist 8192
[2024-02-07T17:46:13.731] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
[2024-02-07T17:46:13.731] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [23]: mask_cpu,one_thread,
0x07
[2024-02-07T17:47:13.825] [23.0] done with job
2. N2
[2024-02-07T17:46:12.357] launch task StepId=23.0 request from UID:0 GID:0 HOST:192.168.100.40 PORT:57970
[2024-02-07T17:46:12.357] task/affinity: lllp_distribution: JobId=23 implicit auto binding: cores,one_thread, dist 8192
[2024-02-07T17:46:12.357] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
[2024-02-07T17:46:12.357] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [23]: mask_cpu,one_thread,
0x07
[2024-02-07T17:47:12.450] [23.0] done with job
3. N3
[2024-02-07T17:46:11.580] launch task StepId=23.0 request from UID:0 GID:0 HOST:192.168.100.40 PORT:50374
[2024-02-07T17:46:11.580] task/affinity: lllp_distribution: JobId=23 implicit auto binding: cores,one_thread, dist 8192
[2024-02-07T17:46:11.580] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
[2024-02-07T17:46:11.580] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [23]: mask_cpu,one_thread,
0x07
[2024-02-07T17:47:11.671] [23.0] done with job
四. 总结
通过log日志首先可用确定分配在3个节点上执行了tasks.其次,通过log中的cpu_mask(0x07) 可用看出每一个节点上有3个CPUs分配给该Job.