Discussion:
Protocol mismatch
Vsevolod Nikonorov
2014-09-01 09:24:30 UTC
Permalink
Hello,

I tried to test slurm-14.11 on some of my nodes while other nodes ran
slurm-2.5.7, and nodes running 14.11 were not excluded from 2.5.7
controller config. It seems like something confused 2.5.7 controller,
for tasks have doubled for some time (each task were visible twice in
smap list), and after excluding 14.11 nodes from 2.5.7 controller config
those tasks have restarted and doubling has ended.

Can protocol mismatch (which was definitely visible in log) be related
to task doubling and hanging? Are there any other safety measures except
cross-excluding foreign-version nodes from controllers? I don't want to
make our polite users sad again :)

Thanks in advance!
--
Никоноров Всеволод Дмитриевич, ОИТТиС, НИКИЭТ

Vsevolod D. Nikonorov, JSC NIKET
Continue reading on narkive:
Loading...