TOWAYINFO

跨节点均衡分配资源

首页    技术视角    跨节点均衡分配资源

一.环境(Rocky 8.8/openEuler 22.03 , slurm 23.02)

   1.  Node     

Nodename

N1

N2

N3

N5

Number of Sockets

2

2

2

1

Number of Cores per Socket

4

4

4

4

Total Number of Cores

8

8

8

4

Number of Threads (CPUs) per Core

1

1

1

2

Total Number of CPUs

8

8

8

8

   2. Partition

PartitionName

Part001

Part003

Nodes

N1/N2/N3

N5

Default

YES

-

 

二. Job 运行

    1.Job 需求

      一个job需要 9 个CPUs (3 tasks and 3 CPUs per task with no overcommitment). 从默认分区中的 3 个节点中的每一个节点分配 3 个 CPUs.

    2. 任务分布   

Nodename

N1

N2

N3

Number of Allocated CPUs

3

3

3

Number of Tasks

1

1

1

    3. 参数配置

        SelectType=select/cons_tres

        SelectTypeParameters=CR_Core

    4. 执行命令

       srun --nodes=3-3 --ntasks=3 --cpus-per-task=3 sleep 60

 

 三. Log 日志

     1. N1    

[2024-02-07T17:46:13.730] launch task StepId=23.0 request from UID:0 GID:0 HOST:192.168.100.40 PORT:56210

[2024-02-07T17:46:13.730] task/affinity: lllp_distribution: JobId=23 implicit auto binding: cores,one_thread, dist 8192

[2024-02-07T17:46:13.731] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic

[2024-02-07T17:46:13.731] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [23]: mask_cpu,one_thread,

0x07

[2024-02-07T17:47:13.825] [23.0] done with job

    2. N2

[2024-02-07T17:46:12.357] launch task StepId=23.0 request from UID:0 GID:0 HOST:192.168.100.40 PORT:57970

[2024-02-07T17:46:12.357] task/affinity: lllp_distribution: JobId=23 implicit auto binding: cores,one_thread, dist 8192

[2024-02-07T17:46:12.357] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic

[2024-02-07T17:46:12.357] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [23]: mask_cpu,one_thread,

0x07

[2024-02-07T17:47:12.450] [23.0] done with job

    3. N3

[2024-02-07T17:46:11.580] launch task StepId=23.0 request from UID:0 GID:0 HOST:192.168.100.40 PORT:50374

[2024-02-07T17:46:11.580] task/affinity: lllp_distribution: JobId=23 implicit auto binding: cores,one_thread, dist 8192

[2024-02-07T17:46:11.580] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic

[2024-02-07T17:46:11.580] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [23]: mask_cpu,one_thread,

0x07

[2024-02-07T17:47:11.671] [23.0] done with job

 

四. 总结

通过log日志首先可用确定分配在3个节点上执行了tasks.其次,通过log中的cpu_mask(0x07) 可用看出每一个节点上有3个CPUs分配给该Job.