5. Repairing Slurm

If slurm is acting strange or not firing jobs off properly it is best to repair slurm for the entire cluster.    In order to restart the slurm services on the entire cluster issue the follow commands:  [root@HeadNode ~]# service slurmctld stop  [root@HeadNode ~]# gsh ‘service slurmd stop’  [root@HeadNode ~]# service slurmctld start  [root@HeadNode ~]#… Continue reading 5. Repairing Slurm

3. Interactive Job

Interactive jobs allow you to run something on the scheduler with either a prompt or it can give you the ability to troubleshoot through slurm.  It is often easier to run something in interactive than trying to track down the node that initially fired off the job.  salloc – can be used to allocate resources … Continue reading 3. Interactive Job

4. Stopping A Job

scancel <job id#> – This command will stop a job.  *Do not issue a kill or a pkill it can take up to a few minutes for the job to cancel out. 

2. Monitoring a Job

Once you get your jobs in the queue you will need to monitor it to make sure they are working properly.  The following commands can be used in order to view the current status of nodes and also the status of the jobs in the queue.  sinfo – This command will display information about what… Continue reading 2. Monitoring a Job

1. Running A Job

The general procedure and concept of a scheduler is simple.  You will essentially create a slurm batch script and then queue it up to be run on the cluster.  Inside the script you will request resources that have been allocated to the scheduler(configured in the /etc/slurm/slurm.conf file).    Below is an example of a slurm… Continue reading 1. Running A Job