Markus Blank-Burian
2014-07-15 13:42:36 UTC
Hi,
after job 436172 completed, the slurmctld daemon segfaulted. Starting
slurmctld again reproduces the segfault. Debugging with gdb shows the
following backtrace. How can i fix this without losing the complete state?
Markus
slurmctld: _sync_nodes_to_comp_job: Job 436172 in completing state
[New Thread 0x7ffff2906700 (LWP 15397)]
[New Thread 0x7ffff2805700 (LWP 15398)]
slurmctld: debug: Priority MULTIFACTOR plugin loaded
slurmctld: debug2: _adjust_limit_usage: job 436172: MPC: job_memory set to
16384
slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
[New Thread 0x7ffff2704700 (LWP 15399)]
slurmctld: _sync_nodes_to_comp_job: completing 1 jobs
slurmctld: debug: Updating partition uid access list
slurmctld: Recovered state of 0 reservations
slurmctld: State of 0 triggers recovered
[New Thread 0x7ffff2603700 (LWP 15400)]
slurmctld: debug2: got 1 threads to send out
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: cons_res: select_p_reconfigure
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 3 partitions
[New Thread 0x7ffff2502700 (LWP 15401)]
[New Thread 0x7ffff2401700 (LWP 15402)]
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: Running as primary controller
slurmctld: Registering slurmctld at port 6817 with slurmdbd.
slurmctld: debug2: Tree head got back 1
[Thread 0x7ffff2401700 (LWP 15402) exited]
slurmctld: cleanup_completing: job 436172 completion process took 2671 seconds
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff2502700 (LWP 15401)]
0x000000000054e0a7 in gres_plugin_job_clear (job_gres_list=<optimized out>) at
gres.c:2945
2945 FREE_NULL_BITMAP(job_state_ptr-
>gres_bit_step_alloc[i]);
(gdb) bt
#0 0x000000000054e0a7 in gres_plugin_job_clear (job_gres_list=<optimized
out>) at gres.c:2945
#1 0x000000000048e350 in delete_step_records (job_ptr=***@entry=0xb85b08)
at step_mgr.c:263
#2 0x000000000045d7d3 in cleanup_completing (job_ptr=***@entry=0xb85b08)
at job_scheduler.c:3057
#3 0x000000000046713c in make_node_idle (node_ptr=0x7ff728,
job_ptr=***@entry=0xb85b08) at node_mgr.c:3072
#4 0x000000000044bac6 in job_epilog_complete (job_id=436172,
node_name=0x7fffe0000cc8 "kaa-23", return_code=***@entry=0) at
job_mgr.c:10265
#5 0x0000000000436d7c in _thread_per_group_rpc (args=0x7fffe8000a28) at
agent.c:923
#6 0x00007ffff7486ed3 in start_thread (arg=0x7ffff2502700) at
pthread_create.c:308
#7 0x00007ffff71bbe2d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) list
2940 if (!job_gres_ptr)
2941 continue;
2942 job_state_ptr = (gres_job_state_t *) job_gres_ptr-
>gres_data;
2943 for (i = 0; i < job_state_ptr->node_cnt; i++) {
2944 FREE_NULL_BITMAP(job_state_ptr-
>gres_bit_alloc[i]);
2945 FREE_NULL_BITMAP(job_state_ptr-
>gres_bit_step_alloc[i]);
2946 }
2947 xfree(job_state_ptr->gres_bit_alloc);
2948 xfree(job_state_ptr->gres_bit_step_alloc);
2949 xfree(job_state_ptr->gres_cnt_step_alloc);
(gdb)
after job 436172 completed, the slurmctld daemon segfaulted. Starting
slurmctld again reproduces the segfault. Debugging with gdb shows the
following backtrace. How can i fix this without losing the complete state?
Markus
slurmctld: _sync_nodes_to_comp_job: Job 436172 in completing state
[New Thread 0x7ffff2906700 (LWP 15397)]
[New Thread 0x7ffff2805700 (LWP 15398)]
slurmctld: debug: Priority MULTIFACTOR plugin loaded
slurmctld: debug2: _adjust_limit_usage: job 436172: MPC: job_memory set to
16384
slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
[New Thread 0x7ffff2704700 (LWP 15399)]
slurmctld: _sync_nodes_to_comp_job: completing 1 jobs
slurmctld: debug: Updating partition uid access list
slurmctld: Recovered state of 0 reservations
slurmctld: State of 0 triggers recovered
[New Thread 0x7ffff2603700 (LWP 15400)]
slurmctld: debug2: got 1 threads to send out
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: cons_res: select_p_reconfigure
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 3 partitions
[New Thread 0x7ffff2502700 (LWP 15401)]
[New Thread 0x7ffff2401700 (LWP 15402)]
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: Running as primary controller
slurmctld: Registering slurmctld at port 6817 with slurmdbd.
slurmctld: debug2: Tree head got back 1
[Thread 0x7ffff2401700 (LWP 15402) exited]
slurmctld: cleanup_completing: job 436172 completion process took 2671 seconds
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff2502700 (LWP 15401)]
0x000000000054e0a7 in gres_plugin_job_clear (job_gres_list=<optimized out>) at
gres.c:2945
2945 FREE_NULL_BITMAP(job_state_ptr-
>gres_bit_step_alloc[i]);
(gdb) bt
#0 0x000000000054e0a7 in gres_plugin_job_clear (job_gres_list=<optimized
out>) at gres.c:2945
#1 0x000000000048e350 in delete_step_records (job_ptr=***@entry=0xb85b08)
at step_mgr.c:263
#2 0x000000000045d7d3 in cleanup_completing (job_ptr=***@entry=0xb85b08)
at job_scheduler.c:3057
#3 0x000000000046713c in make_node_idle (node_ptr=0x7ff728,
job_ptr=***@entry=0xb85b08) at node_mgr.c:3072
#4 0x000000000044bac6 in job_epilog_complete (job_id=436172,
node_name=0x7fffe0000cc8 "kaa-23", return_code=***@entry=0) at
job_mgr.c:10265
#5 0x0000000000436d7c in _thread_per_group_rpc (args=0x7fffe8000a28) at
agent.c:923
#6 0x00007ffff7486ed3 in start_thread (arg=0x7ffff2502700) at
pthread_create.c:308
#7 0x00007ffff71bbe2d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) list
2940 if (!job_gres_ptr)
2941 continue;
2942 job_state_ptr = (gres_job_state_t *) job_gres_ptr-
>gres_data;
2943 for (i = 0; i < job_state_ptr->node_cnt; i++) {
2944 FREE_NULL_BITMAP(job_state_ptr-
>gres_bit_alloc[i]);
2945 FREE_NULL_BITMAP(job_state_ptr-
>gres_bit_step_alloc[i]);
2946 }
2947 xfree(job_state_ptr->gres_bit_alloc);
2948 xfree(job_state_ptr->gres_bit_step_alloc);
2949 xfree(job_state_ptr->gres_cnt_step_alloc);
(gdb)