资源过度分配任务
一.环境(Rocky 8.8/openEuler 22.03 , slurm 23.02)
1. Node
Nodename |
N1 |
N2 |
N3 |
N5 |
Number of Sockets |
2 |
2 |
2 |
1 |
Number of Cores per Socket |
4 |
4 |
4 |
4 |
Total Number of Cores |
8 |
8 |
8 |
4 |
Number of Threads (CPUs) per Core |
1 |
1 |
1 |
2 |
Total Number of CPUs |
8 |
8 |
8 |
8 |
2. Partition
PartitionName |
Part001 |
Part003 |
Nodes |
N1/N2/N3 |
N5 |
Default |
YES |
- |
二. Job 运行
1.Job 需求
一个job有 20 tasks. 在单个节点中运行作业.
2. 任务分布
Nodename |
N1 |
N2 |
N3 |
Number of Allocated CPUs |
8 |
0 |
0 |
Number of Tasks |
20 |
0 |
0 |
Distribution of Tasks to Nodes, by Task id |
0 - 19 |
- |
- |
3. 参数配置
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
4. 执行命令
srun --nodes=1-1 --ntasks=20 --overcommit sleep 60
三. Log 日志
1. N1
[2024-02-07T16:57:20.596] launch task StepId=17.0 request from UID:0 GID:0 HOST:192.168.100.40 PORT:59280
[2024-02-07T16:57:20.596] task/affinity: lllp_distribution: JobId=17 auto binding off: mask_cpu,one_thread
[2024-02-07T16:58:20.964] [17.0] done with job
四. 总结
通过log日志首先可用确定分配在1个节点上执行了tasks.其次,通过log中的 auto binding off 可用看出,当单个CPU上执行多个task时,绑定是禁用的,也就不存在 cpu_mask ,因为这时的CPUs少于Tasks.