Segfault in gres.c:2945 with 14.03.5

Discussion:

Markus Blank-Burian

2014-07-15 13:42:36 UTC

Hi,

after job 436172 completed, the slurmctld daemon segfaulted. Starting
slurmctld again reproduces the segfault. Debugging with gdb shows the
following backtrace. How can i fix this without losing the complete state?

Markus

slurmctld: _sync_nodes_to_comp_job: Job 436172 in completing state
[New Thread 0x7ffff2906700 (LWP 15397)]
[New Thread 0x7ffff2805700 (LWP 15398)]
slurmctld: debug: Priority MULTIFACTOR plugin loaded
slurmctld: debug2: _adjust_limit_usage: job 436172: MPC: job_memory set to
16384
slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
[New Thread 0x7ffff2704700 (LWP 15399)]
slurmctld: _sync_nodes_to_comp_job: completing 1 jobs
slurmctld: debug: Updating partition uid access list
slurmctld: Recovered state of 0 reservations
slurmctld: State of 0 triggers recovered
[New Thread 0x7ffff2603700 (LWP 15400)]
slurmctld: debug2: got 1 threads to send out
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: cons_res: select_p_reconfigure
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 3 partitions
[New Thread 0x7ffff2502700 (LWP 15401)]
[New Thread 0x7ffff2401700 (LWP 15402)]
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: Running as primary controller
slurmctld: Registering slurmctld at port 6817 with slurmdbd.
slurmctld: debug2: Tree head got back 1
[Thread 0x7ffff2401700 (LWP 15402) exited]
slurmctld: cleanup_completing: job 436172 completion process took 2671 seconds

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff2502700 (LWP 15401)]
0x000000000054e0a7 in gres_plugin_job_clear (job_gres_list=<optimized out>) at
gres.c:2945
2945 FREE_NULL_BITMAP(job_state_ptr-
>gres_bit_step_alloc[i]);
(gdb) bt
#0 0x000000000054e0a7 in gres_plugin_job_clear (job_gres_list=<optimized
out>) at gres.c:2945
#1 0x000000000048e350 in delete_step_records (job_ptr=***@entry=0xb85b08)
at step_mgr.c:263
#2 0x000000000045d7d3 in cleanup_completing (job_ptr=***@entry=0xb85b08)
at job_scheduler.c:3057
#3 0x000000000046713c in make_node_idle (node_ptr=0x7ff728,
job_ptr=***@entry=0xb85b08) at node_mgr.c:3072
#4 0x000000000044bac6 in job_epilog_complete (job_id=436172,
node_name=0x7fffe0000cc8 "kaa-23", return_code=***@entry=0) at
job_mgr.c:10265
#5 0x0000000000436d7c in _thread_per_group_rpc (args=0x7fffe8000a28) at
agent.c:923
#6 0x00007ffff7486ed3 in start_thread (arg=0x7ffff2502700) at
pthread_create.c:308
#7 0x00007ffff71bbe2d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) list
2940 if (!job_gres_ptr)
2941 continue;
2942 job_state_ptr = (gres_job_state_t *) job_gres_ptr-
>gres_data;
2943 for (i = 0; i < job_state_ptr->node_cnt; i++) {
2944 FREE_NULL_BITMAP(job_state_ptr-
>gres_bit_alloc[i]);
2945 FREE_NULL_BITMAP(job_state_ptr-
>gres_bit_step_alloc[i]);
2946 }
2947 xfree(job_state_ptr->gres_bit_alloc);
2948 xfree(job_state_ptr->gres_bit_step_alloc);
2949 xfree(job_state_ptr->gres_cnt_step_alloc);
(gdb)

Marcin Stolarek

2014-07-15 13:46:19 UTC

Permalink

2014-07-15 15:43 GMT+02:00 Markus Blank-Burian <burian-***@public.gmane.org>:

> Hi,
>
> after job 436172 completed, the slurmctld daemon segfaulted. Starting
> slurmctld again reproduces the segfault. Debugging with gdb shows the
> following backtrace. How can i fix this without losing the complete state?
>
> Markus
>
>
check this bug:
http://bugs.schedmd.com/show_bug.cgi?id=958

>
> slurmctld: _sync_nodes_to_comp_job: Job 436172 in completing state
> [New Thread 0x7ffff2906700 (LWP 15397)]
> [New Thread 0x7ffff2805700 (LWP 15398)]
> slurmctld: debug: Priority MULTIFACTOR plugin loaded
> slurmctld: debug2: _adjust_limit_usage: job 436172: MPC: job_memory set to
> 16384
> slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
> [New Thread 0x7ffff2704700 (LWP 15399)]
> slurmctld: _sync_nodes_to_comp_job: completing 1 jobs
> slurmctld: debug: Updating partition uid access list
> slurmctld: Recovered state of 0 reservations
> slurmctld: State of 0 triggers recovered
> [New Thread 0x7ffff2603700 (LWP 15400)]
> slurmctld: debug2: got 1 threads to send out
> slurmctld: read_slurm_conf: backup_controller not specified.
> slurmctld: cons_res: select_p_reconfigure
> slurmctld: cons_res: select_p_node_init
> slurmctld: cons_res: preparing for 3 partitions
> [New Thread 0x7ffff2502700 (LWP 15401)]
> [New Thread 0x7ffff2401700 (LWP 15402)]
> slurmctld: debug2: Tree head got back 0 looking for 1
> slurmctld: Running as primary controller
> slurmctld: Registering slurmctld at port 6817 with slurmdbd.
> slurmctld: debug2: Tree head got back 1
> [Thread 0x7ffff2401700 (LWP 15402) exited]
> slurmctld: cleanup_completing: job 436172 completion process took 2671
> seconds
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff2502700 (LWP 15401)]
> 0x000000000054e0a7 in gres_plugin_job_clear (job_gres_list=<optimized
> out>) at
> gres.c:2945
> 2945 FREE_NULL_BITMAP(job_state_ptr-
> >gres_bit_step_alloc[i]);
> (gdb) bt
> #0 0x000000000054e0a7 in gres_plugin_job_clear (job_gres_list=<optimized
> out>) at gres.c:2945
> #1 0x000000000048e350 in delete_step_records (job_ptr=***@entry
> =0xb85b08)
> at step_mgr.c:263
> #2 0x000000000045d7d3 in cleanup_completing (job_ptr=***@entry
> =0xb85b08)
> at job_scheduler.c:3057
> #3 0x000000000046713c in make_node_idle (node_ptr=0x7ff728,
> job_ptr=***@entry=0xb85b08) at node_mgr.c:3072
> #4 0x000000000044bac6 in job_epilog_complete (job_id=436172,
> node_name=0x7fffe0000cc8 "kaa-23", return_code=***@entry=0) at
> job_mgr.c:10265
> #5 0x0000000000436d7c in _thread_per_group_rpc (args=0x7fffe8000a28) at
> agent.c:923
> #6 0x00007ffff7486ed3 in start_thread (arg=0x7ffff2502700) at
> pthread_create.c:308
> #7 0x00007ffff71bbe2d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
> (gdb) list
> 2940 if (!job_gres_ptr)
> 2941 continue;
> 2942 job_state_ptr = (gres_job_state_t *) job_gres_ptr-
> >gres_data;
> 2943 for (i = 0; i < job_state_ptr->node_cnt; i++) {
> 2944 FREE_NULL_BITMAP(job_state_ptr-
> >gres_bit_alloc[i]);
> 2945 FREE_NULL_BITMAP(job_state_ptr-
> >gres_bit_step_alloc[i]);
> 2946 }
> 2947 xfree(job_state_ptr->gres_bit_alloc);
> 2948 xfree(job_state_ptr->gres_bit_step_alloc);
> 2949 xfree(job_state_ptr->gres_cnt_step_alloc);
> (gdb)

Markus Blank-Burian

2014-07-15 13:52:32 UTC

Permalink

> check this bug:
> http://bugs.schedmd.com/show_bug.cgi?id=958
>

Thanks for the quick reply!

Blomqvist Janne

2014-07-15 19:47:32 UTC

Permalink

Hi,

FWIW we're hitting this bug as well with 14.03.5. 14.03.4 was fine, so this seems to be a recent regression. Luckily per bugzilla the bug has already been fixed.

--
Janne Blomqvist

________________________________________
From: Markus Blank-Burian [burian-***@public.gmane.org]
Sent: Tuesday, July 15, 2014 16:52
To: slurm-dev
Subject: [slurm-dev] Re: Segfault in gres.c:2945 with 14.03.5

> check this bug:
> http://bugs.schedmd.com/show_bug.cgi?id=958
>

Thanks for the quick reply!

Franco Broi

2014-07-16 05:58:30 UTC

Permalink

Me too.

I patched gres.c but I think you should do a quick bug fix release...

--- ../slurm-14.03.5.save/src/common/gres.c 2014-07-11 01:26:55.000000000 +0800
+++ ./src/common/gres.c 2014-07-16 13:46:32.458013818 +0800
@@ -2942,6 +2942,7 @@
job_state_ptr = (gres_job_state_t *) job_gres_ptr->gres_data;
for (i = 0; i < job_state_ptr->node_cnt; i++) {
FREE_NULL_BITMAP(job_state_ptr->gres_bit_alloc[i]);
+ if(job_state_ptr->gres_bit_step_alloc)
FREE_NULL_BITMAP(job_state_ptr->gres_bit_step_alloc[i]);
}
xfree(job_state_ptr->gres_bit_alloc);

On Tue, 2014-07-15 at 12:47 -0700, Blomqvist Janne wrote:
> Hi,
>
> FWIW we're hitting this bug as well with 14.03.5. 14.03.4 was fine, so this seems to be a recent regression. Luckily per bugzilla the bug has already been fixed.
>
> --
> Janne Blomqvist
>
> ________________________________________
> From: Markus Blank-Burian [burian-***@public.gmane.org]
> Sent: Tuesday, July 15, 2014 16:52
> To: slurm-dev
> Subject: [slurm-dev] Re: Segfault in gres.c:2945 with 14.03.5
>
> > check this bug:
> > http://bugs.schedmd.com/show_bug.cgi?id=958
> >
>
> Thanks for the quick reply!