Всеволод Никоноров
2014-09-01 10:26:30 UTC
Hello,
I tried to test slurm-14.11 on some of my nodes while other nodes ran slurm-2.5.7, and nodes running 14.11 were not excluded from 2.5.7 controller config. It seems like something confused 2.5.7 controller, for tasks have doubled for some time (each task were visible twice in smap list), and after excluding 14.11 nodes from 2.5.7 controller config those tasks have restarted and doubling has ended.
Can protocol mismatch (which was definitely visible in log) be related to task doubling and hanging? Are there any other safety measures except cross-excluding foreign-version nodes from controllers? I don't want to make our polite users sad again :)
Thanks in advance!
I tried to test slurm-14.11 on some of my nodes while other nodes ran slurm-2.5.7, and nodes running 14.11 were not excluded from 2.5.7 controller config. It seems like something confused 2.5.7 controller, for tasks have doubled for some time (each task were visible twice in smap list), and after excluding 14.11 nodes from 2.5.7 controller config those tasks have restarted and doubling has ended.
Can protocol mismatch (which was definitely visible in log) be related to task doubling and hanging? Are there any other safety measures except cross-excluding foreign-version nodes from controllers? I don't want to make our polite users sad again :)
Thanks in advance!