User Guide

A simple guide for your NZO Cloud.

Access the user guide for helpful information on accessing, running and troubleshooting your NZO Cloud.

Access

1. Initial Set Up

NZO cloud is a unique product in which a custom solution is created specifically for you working directly with an NZO Engineer. Since each solution is highly specialized to fit your needs, a custom hardware design is often necessary before initial set up. You can choose from following:

– A direct point to point VPN Tunnel connected to a shared or dedicated firewall.

– A direct SSL VPN solution that connects your users from their desktops and or devices directly to resources.

– Bring your own connectivity – A solution that allows the you to connect using modern cloud technologies that are already being deployed inside of your infrastructure.

2. A direct point to VPN Tunnel

NZO is all about putting the System’s Administrator, Engineer and IT Staff in control of compute, storage, and network systems using state of the art open source technology. As such, you can choose to take advantage of traditional VPN Tunnels in order to extend your network’s needs. When using a point to point tunnel you are essentially taking your infrastructure and extending it into NZO’s cloud resources giving you access to state of the art resources such as massive compute on bare metal hardware or huge storage arrays. NZO’s direct point to point solution is as simple as working with your networking team and setting up an encrypted tunnel.

3. P2P VPN Tunnel

4. Direct SSL VPN Connection

Connecting researchers, educators, it staff and other personal can be a daunting task for any System’s Administrator. NZO Engineers can work with you to create direct SSL VPN Tunnels from your users directly to NZO resources.

Download the complete SSL VPN guide hereDownload PDF

Accounts

1. Creating Accounts

The syncusers script replicates user accounts across the NZO HPC Cluster making it necessary to add accounts only to the head node.

1. After logging into the head node as root or sudo -i issue the following commands listed below, changing “John Doe” and jdoe to your respective user that you are adding to the system.

# useradd -c “John Doe” jdoe

# passwd jdoe

# su – jdoe

$ exit

# syncusers

*Note: The “su -” step is necessary as ssh keys will be generated for new user accounts upon the first login and also to test the account. Once the keys are in place and the user accounts are synchronized, the user will be able to use “passwordless” ssh, which is necessary for running jobs on the cluster.

2. Managing Accounts

From time to time you will want to check user’s home directories and logins. The following commands and processes listed below will make your life easier as an NZO HPC administrator:

– Checking user’s space in /home

# du –h —max-depth=1 /home/jdoe”

– Monitoring good login attempts

# last

– Monitoring failed login attempts

# lastb

– Monitoring who is currently logged in

# w

3. Deleting Accounts

Below are the steps required to delete a user account on the NZO HPC Cloud. Please make sure to login to the head node as root or sudo –i. I would recomm

# userdel -r jdoe (use without the ‘-r’ to keep user’s home directory)

# syncusers

Groups

1. Adding Users

The syncusers script replicates user accounts across the PowerWulf Cluster™ making it necessary to add accounts only to the head node.

Below are the steps required to add a new user. Be sure to login to the head node as root and type the following (e.g. new user account equals John Doe):

useradd -c “John Doe” jdoe

passwd jdoe

su – jdoe

exit

syncusers

Note: The “su -” step is necessary as ssh keys will be generated for new user accounts upon the first login. Once the keys are in place and the user accounts are synchronized, the user will be able to use “passwordless” ssh, which is necessary for running jobs on the cluster.

2. Deleting Groups

Below are the steps required to delete a user. Be sure to login to the head node as root and type the following:

userdel -r jdoe (use without the ‘-r’ to keep user’s home directory)

syncusers

Job Control

1. Running A Job

The general procedure and concept of a scheduler is simple. You will essentially create a slurm batch script and then queue it up to be run on the cluster. Inside the script you will request resources that have been allocated to the scheduler(configured in the /etc/slurm/slurm.conf file).

Below is an example of a slurm batch file named example.batch.sh

#!/bin/bash

#SBATCH —job-name=”test-job”

#SBATCH —output=”outfile.out.%j”

#SBATCH —ntasks-per-node=32

#SBATCH —nodes=2

#SBATCH —time=00:00:00

#SBATCH —partition=normal

module purge

module load openmpi/4.1.5_gnu

mpirun ./myprogram

From here you would issue the command sbatch example.batch.sh as a user.

2. Monitoring a Job

Once you get your jobs in the queue you will need to monitor it to make sure they are working properly. The following commands can be used in order to view the current status of nodes and also the status of the jobs in the queue.

sinfo – This command will display information about what current nodes are availible and the status that they are in.

squeue – This command will give you information about what jobs are in queue and what robs are currently running.

3. Interactive Job

Interactive jobs allow you to run something on the scheduler with either a prompt or it can give you the ability to troubleshoot through slurm. It is often easier to run something in interactive than trying to track down the node that initially fired off the job.

salloc – can be used to allocate resources

srun – can be used to actually run things in interactive mode.

4. Stopping A Job

scancel <job id#> – This command will stop a job. *Do not issue a kill or a pkill it can take up to a few minutes for the job to cancel out.

5. Repairing Slurm

If slurm is acting strange or not firing jobs off properly it is best to repair slurm for the entire cluster.

In order to restart the slurm services on the entire cluster issue the follow commands:

[root@HeadNode ~]# service slurmctld stop

[root@HeadNode ~]# gsh ‘service slurmd stop’

[root@HeadNode ~]# service slurmctld start

[root@HeadNode ~]# gsh ‘service slurmd start’

A few other things to check

Make sure the time is correct on all nodes

[root@HeadNode ~]# gsh ‘date’

The slurm configuration on the head node matches the nodes availible resources. You can verify this with the slurmd –C command.
The nodes do not have rogue processes running on them. If you allocate resources and they are not available then you will want to know why.

Monitoring

1. Grafana Overview

Grafana is an open-source data visualization tool that allows users to create interactive dashboards and graphs to display real-time data. It provides a range of features such as data parsing, filtering, and drill-down capabilities, making it easy to explore and analyze complex datasets. Users can customize their dashboards with different types of graphs, charts, and widgets to suit their needs. This installation of grafana has been customized to provide an out of the box overview for your NZO HPC resource. Grafana can provide detailed statistics about slurm and your jobs.

2. Accessing Grafana

In order to access grafana you will need to use x2go client and login as either the main user or as root. Once you bring up a desktop environment go ahead and open up a web browser to http://localhost:5300 You will be prompted to login please contact us in order to get the username and password.

After you login Grafana will prompt you with a splash screen. Go ahead and click the Hamburger button From there you will dashboards and you can click on the preferred dashboard you wish to see. We have included five dashboards, one of which integrates into slurm.

In this example you can see the head node being analyzed while a job is running. If you go to the Node Comparison Dashboard you can do a deep dive on a job that is running across multiple jobs. From this example you can clearly see what resources are available and what resources are being used.

You will notice that some of the dashboards have a Report button. If you click it a new tab will open up with a PDF report, giving you a PDF report

Applications

1. Application Overview

Applications on NZO HPC are designed to handle complex computational tasks that require high levels of processing power and memory. These systems are typically used in scientific and engineering fields where large datasets need to be analyzed, simulated or optimized. NZO HPC is currently used across various industries for a range of applications such as weather forecasting, fluid dynamics, aircraft design simulation, and optimization of complex data sets.

2. Environmental Modules

Environment Modules is a package that will allow scripts to be installed in your NZO HPC cluster. We configure these scripts to be placed in the /opt/modulefiles directory and can expand along with simplify the functionality of your NZO HPC deployment. All of the common shells are accessible to the user and are loaded using the module load command. To see a list of modules that are available on your system you can issue a module avail command.

Typically a job is queued up to the scheduler with a batch file. Inside that batch file you will often see a module purge command followed by a module load <environment modulefile> that will load the environment for your specific script to run the job successfully.

A typical batch file for slurm will look like the follow:

#!/bin/sh

#SBATCH –partition=normal

#SBATCH –time=00:00:00

#SBATCH –nodes=2

#SBATCH –ntasks-per-node=8

#SBATCH –job-name=”slurm-xhpl”

#SBATCH –output=slurm-xhpl.out.%j

module purge

module load openmpi/4.1.5_gnu

cd /home/<someuser>/PSSC-tools/xhpl-gnu

mpiexec ./xhpl-openmpi4-gnu-openblas

You will notice the module purge and module load portion of the script. This makes it easy to implement consistency when running jobs across many nodes.

An example module file to load OpenMPI might look like the following:

#%Module1.0#####################################################################

## module-cvs modulefile

## modulefiles/module-git. Generated from module-git.in by configure.

proc ModulesHelp { } {

global version

puts stderr “\tThis module will set up an alias”

puts stderr “\tfor easy anonymous check-out of this version of the”

puts stderr “\tenvironment modules package.”

puts stderr “\tget-modules – retrieve modules sources for this version”

puts stderr “\n\tVersion $version\n”

}

# for Tcl script use only

set compiler gnu

set version 4.1.5

set topdir /opt/openmpi

set sys linux86

module-whatis “environment variables for ompi 4.1.5”

prepend-path PATH $topdir/${version}_${compiler}/include

prepend-path PATH $topdir/${version}_${compiler}/bin

prepend-path MANPATH $topdir/${version}_${compiler}/man

prepend-path LD_LIBRARY_PATH $topdir/${version}_${compiler}/lib

#if [ module-info mode load ] {

# ModulesHelp

3. Application Installation

Applications can be installed either via the package manager, or through compiling options. The /opt directory on the head node is mounted on all the computational nodes. When compiling applications it is best to compile the application on a compute node using a flag of —prefix=/opt/<nameof application> . While it is possible to compile things on the head node it is generally not best practice to do so.

4. Compilers

Each NZO HPC comes with several compilers that make it easier for users to develop, maintain and optimize their code. We provide basic support (we can install and assist with getting started) with amd/aocc, gcc, intel one and nvidia toolkit.

5. LMod

Lmod is a more modern environment module system that can be installed on your cluster by contacting our team of engineers. It works the same as Environment Modules however it provides a much more convenient way to dynamically change the users paths.

6. Open MPI

Environment modules and the module command will be very useful for loading different versions of OpenMPI required by specific HPC code. To see the OpenMPI modules, run the following command (sample output shown).

[root@clusterhn ~]# module avail openmpi

—————————– /opt/modulefiles ——————————

openmpi/2.1.6_gnu openmpi/3.1.6_intel

openmpi/2.1.6_intel openmpi/4.1.1_gnu(default)

openmpi/3.1.6_gnu openmpi/4.1.1_intel

By default the latest version of OpenMPI compiled with the default GNU compiler will be loaded (e.g. openmpi/4.1.1_gnu).

To load a different openmpi module interactively or within a slurm job, first unload the default version, then load the new version.

[user@clusterhn ~]$ module unload openmpi/4.1.1_gnu

[user@clusterhn ~]$ module load openmpi/3.1.6_gnu

For running jobs, the slurm batch file should also include the module load command so the required version of MPI is loaded on all compute nodes executing the job. The following example is the batch file used to run the cluster’s Linpack test program.

#!/bin/sh

#SBATCH –partition=batch

#SBATCH –time=00:00:00

#SBATCH –nodes=22

#SBATCH –ntasks-per-node=32

#SBATCH –job-name=”slurm-xhpl-316″

#SBATCH –output=slurm-xhpl-316-%j.out

module unload mpi/openmpi-4.1.1_gnu

module load mpi/openmpi-3.1.6_gnu

cd /home/user/PSSC-tools

mpiexec xhpl-openmpi3

You can see which MPI is specified in a node’s PATH using the command, echo $PATH.

[user@clusterhn ~]$ echo $PATH /opt/openmpi/4.1.1_gnu/bin:/opt/openmpi/4.1.1_gnu/include:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pssc/bin:/home/user/.local/bin:/home/user/bin

Setting a Default OpenMPI

The cluster is configured to load a default OpenMPI version at power on. As of the publication of this manual revision (10/28/21), the default version is 4.1.1. To check the version, run the following command.

[user@clusterhn ~]$ mpiexec –version mpiexec (OpenRTE) 4.1.1

Loading of the default module is a 2-step process: the default modulefile is specified in the .version file in the /opt/modulefiles/openmpi directory and the module file is loaded at power on by including the module load openmpi in the /etc/profile.d/pssc.sh file.

To change the OpenMPI default, modify the /opt/modulefiles/openmpi/.version file to specify the target version. To disable loading OpenMPI at power on, comment out (precede with #) the module load command in /etc/profile.d/pssc.sh . After making any changes to the pssc.sh file, copy this file to the same location on all of the compute nodes. This can be accomplished by running the following command.

gcp /etc/profile.d/pssc.sh /etc/profile.d/.

MPI Sample Commands

Here are some quick examples of ways to use the MPI(s) installed on your system:

This command will use OpenMPI to run the 10 hostname processes across the cluster:

[user@clusterhn ~]$ mpiexec -np 10 -hostfile /opt/machinelist hostname | sort

Instead of running the above command on any random machines, you can specify the machines on which the processes will run. This command will spawn 5 instances of hostname on n001 and another 5 instances on n002:

[user@clusterhn ~]$ mpiexec -np 10 -H n001,n002 -npernode 5 hostname | sort

These are just a couple rudimentary examples of how you can use OpenMPI. To learn more about using OpenMPI, visit their documentation website. http://www.open-mpi.org/doc/

7. Apptainer

Apptainer is an open source container platform that was previously known as singularity community edition. NZO comes with built in support for apptainer and is the prefered way to run HPC workloads. We do not restrict access to public repositories giving you the ability to try various types of software without worrying about messing up your cluster. For those familiar with docker the table listed below gives a good comparison of commands between docker and apptainer:

Apptainer Docker Description

apptainer pull docker:// docker pull Pull docker image

apptainer run docker run Execute container

apptainer build docker build Build a new container

apptainer exec docker exec Run a command within container

apptainer shell docker run -it Enter into a shell within container

apptainer instance start docker run -d Run container as a service

apptainer instance stop docker stop Stop container

The Definition file (.def) Dockerfile Blueprint to build a container

For more information please look at: https://apptainer.org/docs/user/main/introduction.html https://apptainer.org/docs/user/main/mpi.html https://apptainer.org/docs/user/main/mpi.html#batch-scheduler-slurm