Cluster(s) seem OK, but: Zero Bytes were transmitted or received (14.03.6)

Discussion:

Gerben Roest

2014-08-17 20:25:39 UTC

I run a slurmctld and slurmdbd on a Scientific Linux (SL) 5 server and
have three SL6 nodes, all running Slurm 14.03.6, with one node behind
another slurmctld on another cluster. The whole slurm setup seems to run
fine with tests, even submitting from one cluster to the other.
The slurmctld daemon on the machine where slurmdbd is also running, shows

error: slurm_receive_msg: Zero Bytes were transmitted or received

and is spamming the /var/log/slurm/slurmctld unless I say

scontrol setdebug fatal

and then it's quiet. What could be causing the messages when all (at
least the things I test) seems fine? Does it expect data from long-gone
nodes or something?
Across all nodes and servers the time is synchronized and the munged is
running with the same keys.

Configuration data as of 2014-08-17T18:38:36
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost = server620
AccountingStorageLoc = N/A
AccountingStoragePort = 6819
AccountingStorageType = accounting_storage/slurmdbd
AccountingStorageUser = N/A
AccountingStoreJobComment = YES
AcctGatherEnergyType = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInfinibandType = acct_gather_infiniband/none
AcctGatherNodeFreq = 0 sec
AcctGatherProfileType = acct_gather_profile/none
AuthInfo = (null)
AuthType = auth/munge
BackupAddr = (null)
BackupController = (null)
BatchStartTimeout = 10 sec
BOOT_TIME = 2014-08-17T18:20:51
CacheGroups = 0
CheckpointType = checkpoint/none
ClusterName = testcluster
CompleteWait = 0 sec
ControlAddr = server620
ControlMachine = server620
CoreSpecPlugin = core_spec/none
CryptoType = crypto/munge
DebugFlags = (null)
DefMemPerNode = UNLIMITED
DisableRootJobs = NO
DynAllocPort = 0
EnforcePartLimits = NO
Epilog = (null)
EpilogMsgTime = 2000 usec
EpilogSlurmctld = (null)
ExtSensorsType = ext_sensors/none
ExtSensorsFreq = 0 sec
FastSchedule = 0
FirstJobId = 1
GetEnvTimeout = 2 sec
GresTypes = (null)
GroupUpdateForce = 0
GroupUpdateTime = 600 sec
HASH_VAL = Match
HealthCheckInterval = 0 sec
HealthCheckNodeState = ANY
HealthCheckProgram = (null)
InactiveLimit = 0 sec
JobAcctGatherFrequency = 30
JobAcctGatherType = jobacct_gather/linux
JobAcctGatherParams = (null)
JobCheckpointDir = /var/slurm/checkpoint
JobCompHost = localhost
JobCompLoc = /var/log/slurm/slurm_jobcomploc
JobCompPort = 0
JobCompType = jobcomp/filetxt
JobCompUser = root
JobContainerPlugin = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend = 0
JobRequeue = 1
JobSubmitPlugins = (null)
KeepAliveTime = SYSTEM_DEFAULT
KillOnBadExit = 0
KillWait = 30 sec
LaunchType = launch/slurm
Licenses = (null)
LicensesUsed = (null)
MailProg = /bin/mail
MaxArraySize = 1001
MaxJobCount = 10000
MaxJobId = 4294901760
MaxMemPerNode = UNLIMITED
MaxStepCount = 40000
MaxTasksPerNode = 128
MessageTimeout = 10 sec
MinJobAge = 300 sec
MpiDefault = none
MpiParams = (null)
NEXT_JOB_ID = 2
OverTimeLimit = 0 min
PluginDir = /usr/local/slurm-sl5/lib/slurm
PlugStackConfig = /usr/local/slurm-etc/plugstack.conf
PreemptMode = GANG,SUSPEND
PreemptType = preempt/partition_prio
PriorityType = priority/basic
PrivateData = none
ProctrackType = proctrack/pgid
Prolog = (null)
PrologSlurmctld = (null)
PrologFlags = (null)
PropagatePrioProcess = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram = (null)
ReconfigFlags = (null)
ResumeProgram = (null)
ResumeRate = 300 nodes/min
ResumeTimeout = 60 sec
ResvEpilog = (null)
ResvOverRun = 0 min
ResvProlog = (null)
ReturnToService = 1
SallocDefaultCommand = (null)
SchedulerParameters = (null)
SchedulerPort = 7321
SchedulerRootFilter = 1
SchedulerTimeSlice = 30 sec
SchedulerType = sched/backfill
SelectType = select/linear
SlurmUser = slurm(1283)
SlurmctldDebug = fatal
SlurmctldLogFile = /var/log/slurm/slurmctld.log
SlurmSchedLogFile = (null)
SlurmctldPort = 6817
SlurmctldTimeout = 120 sec
SlurmdDebug = info
SlurmdLogFile = /var/log/slurm/slurmd.log
SlurmdPidFile = /var/run/slurm/slurmd.pid
SlurmdPlugstack = (null)
SlurmdPort = 6818
SlurmdSpoolDir = /var/spool/slurm/slurmd
SlurmdTimeout = 300 sec
SlurmdUser = root(0)
SlurmSchedLogLevel = 0
SlurmctldPidFile = /var/run/slurm/slurmctld.pid
SlurmctldPlugstack = (null)
SLURM_CONF = /usr/local/slurm-etc/slurm.conf
SLURM_VERSION = 14.03.6
SrunEpilog = (null)
SrunProlog = (null)
StateSaveLocation = /var/spool/slurm
SuspendExcNodes = (null)
SuspendExcParts = (null)
SuspendProgram = (null)
SuspendRate = 60 nodes/min
SuspendTime = NONE
SuspendTimeout = 30 sec
SwitchType = switch/none
TaskEpilog = (null)
TaskPlugin = task/none
TaskPluginParam = (null type)
TaskProlog = (null)
TmpFS = /tmp
TopologyPlugin = topology/none
TrackWCKey = 0
TreeWidth = 50
UsePam = 0
UnkillableStepProgram = (null)
UnkillableStepTimeout = 60 sec
VSizeFactor = 0 percent
WaitTime = 0 sec

thanks,

Gerben

Paddy Doyle

2014-08-18 07:28:32 UTC

Permalink

Hi Gerben,

Post by Gerben Roest
I run a slurmctld and slurmdbd on a Scientific Linux (SL) 5 server and
have three SL6 nodes, all running Slurm 14.03.6, with one node behind
another slurmctld on another cluster. The whole slurm setup seems to run
fine with tests, even submitting from one cluster to the other.
The slurmctld daemon on the machine where slurmdbd is also running, shows
error: slurm_receive_msg: Zero Bytes were transmitted or received

For me, that's usually a version mis-match somewhere. One of the daemons is a
version behind and so there's a protocol mis-match when trying to communicate.
I'd double-check that all versions are the same (and have been restarted since
any upgrades) first.

Paddy

Gerben Roest

2014-08-18 10:12:29 UTC

Permalink

Hi Paddy,

Post by Paddy Doyle

I have checked the versions of the main slurmctld and the slurmd's on
the nodes, and the slurmctld on the other cluster and slurmd's on that
nodes, and all use 14.03.6. I didn't upgrade, started straight from 14.03.6.
The only thing might be that the main master runs 14.03.6 compiled for
SL5 with "-O0" and the others run it from another dir (NFS) compiled
from the same source but without "-O0" and "make installed" to that
other dir, created for SL6 machines (because of GLIBC deps). But I guess
you should be able to run slurm on different builds provided it is the
same version? It all seems to work but I only get the strange logs.
If you would need more verbose information, please let me know what I
have to check.

Gerben

Nathan Harper

2014-08-18 10:28:37 UTC

Permalink

Hi,

Just a 'me-too' - also running 14.03.6 on compute nodes, with master nodes
running RHEL5 with -O0 and getting the same thing in the logs, so it's not
just you.
--
*Nathan Harper* // IT Systems Architect

*e: * nathan.harper-0eSKoRi+iP310XsdtD+***@public.gmane.org // *t: * 0117 906 1104 // *m: * 07875 510891 //
*w: * www.cfms.org.uk <http://www.cfms.org.uk%22> // [image: Linkedin grey
icon scaled] <http://uk.linkedin.com/pub/nathan-harper/21/696/b81>
CFMS Services Ltd // Bristol & Bath Science Park // Dirac Crescent // Emersons
Green // Bristol // BS16 7FR

[image: 4.2 CFMS_Artwork_RGB] <http://www.cfms.org.uk>

------------------------------
CFMS Services Ltd is registered in England and Wales No 05742022 - a
subsidiary of CFMS Ltd
CFMS Services Ltd registered office // Victoria House // 51 Victoria Street
// Bristol // BS1 6AD

Post by Gerben Roest
Hi Paddy,

Post by Paddy Doyle

I have checked the versions of the main slurmctld and the slurmd's on the
nodes, and the slurmctld on the other cluster and slurmd's on that nodes,
and all use 14.03.6. I didn't upgrade, started straight from 14.03.6.
The only thing might be that the main master runs 14.03.6 compiled for SL5
with "-O0" and the others run it from another dir (NFS) compiled from the
same source but without "-O0" and "make installed" to that other dir,
created for SL6 machines (because of GLIBC deps). But I guess you should be
able to run slurm on different builds provided it is the same version? It
all seems to work but I only get the strange logs.
If you would need more verbose information, please let me know what I have
to check.
Gerben