Topics Map > •Research Computing

CRC Getting Started on NOTS

CRC Getting Started on NOTS

Introduction

NOTS (Night Owls Time-Sharing Service history of name) is a batch scheduled HPC/HTC cluster running on the Rice Big Research Data (BiRD) cloud infrastructure. The system consists of 298 dual socket compute blades housed within HPE s6500, HPE Apollo 2000, and Dell PowerEdge C6400 chassis. All the nodes are interconnected with 10 or 25 Gigabit Ethernet network. In addition, the Apollo and C6400 chassis are connected with high speed Omni-Path for message passing applications. There is a 160TB Lustre filesystem attached to the compute nodes via Ethernet. The system can support various work loads including single node, parallel, large memory multithreaded, and GPU jobs.

NOTE:  An expansion to NOTS is available for testing.  For details, please see CRC NOTS Expansion (NOTSx)

 

NSF Citation

If you use NOTS to support your research activities, you are required to acknowledge (in publications, on your project web pages, …) the National Science Foundation grant that was used in part to fund the procurement of this system. An example acknowledgement that can be used follows. Feel free to modify wording for your specific needs but please keep the essential information:

This work was supported in part by the Big-Data Private-Cloud Research Cyberinfrastructure MRI-award funded by NSF under grant CNS-1338099 and by Rice University's Center for Research Computing (CRC).

Compute

Hardware
Nodes
CPU
Cores
Hyperthreaded
RAM
Disk
High Speed Network
Storage Network
HPE SL230s 136 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz 16 Yes varies: 32 GB to 128 GB 4 TB/node None 10 GbE
HPE XL170r 28 Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz 24 Yes varies: 32 GB to 128 GB 200 GB/node Omni-Path 10 GbE
Dell PowerEdge C6420 60 Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz 24 Yes 192 GB 120 GB/node Omni-Path 10 GbE
HPE XL170r 52 Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz 40 Yes 192 GB 960 GB/node Omni-Path 25 GbE
HPE XL170r 8 Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz 40 Yes 768 GB 960 GB/node Omni-Path 25 GbE
HPE XL170r 4 Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz 40 Yes 1.5 TB 960 GB/node Omni-Path 25 GbE
Dell PowerEdge C6520 3 Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz 48 Yes 256 GB 960 GB/node Omni-Path 25 GbE

GPU

Hardware
Nodes
CPU
Cores
Hyperthreaded
RAM
Disk
GPU
High Speed Network
Storage Network
HPE SL270s 2 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz 16 Yes 128 GB 4 TB/node 4 x Tesla K80 None 10 GbE
HPE XL675d 3 AMD EPYC 7343 CPU @ 3.2GHz 32 Yes 512 GB 960 GB/node 8 x Tesla A40 Omni-Path 25 GbE
HPE XL190 16 Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz 40 Yes 192 GB 960 GB/node 2 x Tesla V100 Omni-Path 25 GbE

Prerequisites for Using the Shared Research Computing Resources

All of the clusters that make up our shared computing resources run the Linux operating system (not Windows or MacOS operating systems).  In order to effectively utilize the clusters you must have knowledge of Linux, how to navigate the filesystem, how to create, edit, rename, and delete files, and how to run basic commands and write small scripts.  If you need assistance in this area please review the tutorials that are available on our web site.

Logging Into the Cluster

The cluster login nodes can be accessed through Secure Shell from any machine on the Rice campus network.  You will need an active NetID and password in order to login (unless otherwise instructed).

You must apply for an account.

If you do not have an account on the Shared Computing Resources then you should apply for one. You will need a faculty sponsor for your account who is willing to pay the access fee.

If you need off-campus access, please visit our Off-Campus Access Guide.

To login to the system from a Linux or Unix machine, use the ssh command:

$ ssh -Y (your_login_name)@hostname.rice.edu

Substitute the actual host name of the cluster in place of hostname.rice.edu above.  For example, nots.rice.edu, depending on which cluster you want to access.

To transfer files into the cluster from a Linux or Unix machine, use the scp command:

$ scp some_file.dat *.incl *.txt (your_login_name)@hostname.rice.edu:

Substitute the actual host name of the cluster in place of  hostname.rice.edu  above.  For example, nots.rice.edu, depending on which cluster you want to access.

For more information about using Secure Shell, please see our Using SSH to Login and Copy Files on the campus network Guide.

Login Nodes

Once you are logged in to the system, you are logged into one of several login nodes as shown in the diagram below. These nodes are intended for users to compile software, prepare data files, and submit jobs to the job queue. They are not intended for running compute jobs. Please run all compute jobs in one of the job queues described later in this document.

Login diagram

Diagram courtesy of Chris Hunter, Rice University

Do not run compute jobs on login nodes

Cluster login nodes are multi-user access points intended for users to compile software, copy and prepare data files, and submit jobs to the job queue. Any user running intensive computational tasks directly on the login node risks disciplinary action up to and including the loss of their access privileges.

Data and Quotas

A summary of all filesystems available to all users is presented in the following table:

Filesystem

Accessed via environment variable

Physical Path

Size

Quota

Type

Purge Policy

Home directories

$HOME

/home

5 TB

10 GB

NFS

none

Group Project directories

$PROJECTS

/projects

20 TB

100 GB per group

NFS

none

Work storage space $WORK /storage/hpc/work 456 TB 2 TB per group NFS none

Shared Scratch high performance I/O

$SHARED_SCRATCH

/scratch

157 TB

None

Lustre

14 days

Local Scratch on each node

$TMPDIR

/tmp

4 TB

None

Local

at the end of each job

 

$SHARED_SCRATCH is not permanent storage

$SHARED_SCRATCH is to be used only for job I/O.  Delete everything you do not need for another run at the end of the job or move to $WORK for analysis. Staff may periodically delete files from the $SHARED_SCRATCH file system even if files are less than 14 days old. A full file system inhibits use of the system for everyone. Using programs or scripts to actively circumvent the file purge policy will not be tolerated.

$WORK on login nodes

$WORK is only available on the login nodes. Data can be copied between $WORK and $SHARED_SCRATCH before/after running jobs.

Data Backups Are the Responsibility of Each User

Backing up and archiving data remains the sole responsibility of the end user. At this point in time shared computing does not offer these services in any automated way. We strongly encourage all users to take full advantage of any storage services to prevent accidental loss or deletion of critical data by contacting us for advice on best practices in data management. We welcome any suggestions for offering a higher level of data security as we move forward with shared computing at Rice.

 

Research Data Compliance

Due to recent changes in NSF, NIH, DOD, and other government granting agencies, Research Data Management has become an important area of growth for Rice and is a critical factor in both conducting and funding research. The onus of maintaining and preserving research data generated by funded research is placed squarely upon the research faculty, post docs, and graduate students conducting the research. It is imperative that you are aware of your compliance responsibilities so as not to jeopardize the ability of Rice University to receive federal funding. We will help to provide you the information and assistance you need, but the best place to start is the SPARC Research Compliance website.

Research Data Regulatory Controls and Restrictions

Regulatory controls restrictions remains the sole responsibility of the end user. At this point in time shared computing does not offer these services on the clusters or research virtual machine service. We strongly encourage all users to take full advantage of any storage services to prevent accidentally putting data on a system that is not designed for such data by contacting us for advice on best practices in data management. We welcome any suggestions for offering a higher level of data security as we move forward with shared computing at Rice. The CRC does not have any kind of policing role, just an advisory one.

 

File Systems, Quotas, and Data Handling

To see your current quota and your disk usage for your home directory, run this command:

quota -s

To see the quota and usage for the $PROJECTS directories for all groups that you belong to, run this command:

quota -sg

To see the quota and usage for the $WORK directories on NOTS for the primary group to which you belong, run this command:

mmlsquota --block-size 1G -g $(id -gn)

 

The clustered file system $SHARED_SCRATCH provides fast, high-bandwidth I/O for running jobs. Though not limited by quotas, $SHARED_SCRATCH is intended for in-flight data being used as input and output for running jobs, and may be periodically cleaned through voluntary and involuntary means as use and abuse dictate.

Volatility of $SHARED_SCRATCH

The $SHARED_SCRATCH filesystem is designed for speed rather than data integrity and therefore may be subject to catastrophic data loss! It is designed for input and output files of running jobs, not persistent storage of data and software.

When dealing with $SHARED_SCRATCH always copy your data in. A "cp" will update the access time on files whereas a move "mv" will preserve the access time. This is important as our periodic cleaning mechanism may purge files where the access time is maintained via the "mv" command.

Avoid I/O over NFS

$HOME and $PROJECTS should not be used for job I/O. Jobs found to be using $HOME and $PROJECTS for job I/O are subject to termination without notice.

Use Variables Everywhere!

NOTE: The physical paths for the above file systems are subject to change. You should always access the filesystems using environment variables, especially in job scripts.

For information on how to use $PROJECTS, please see our FAQ.

Environment and Shells

The default shell on all the CRC clusters is bash. Other popular shells are available. To have your account's default shell changed from bash to one of these, please file a help request and specify the cluster, username, and desired shell in the ticket. Once your shell is changed this is reflective on all clusters with which you have access. Any active login sessions when your shell is changed will need to be terminated to effect change.

Due to the nature of high performance applications and the batch scheduling system used on CRC clusters, managing your shell environment variables properly is vital.

Customizing Your Environment With the Module Command

Each user can customize their environment using the module command. This command lets you select software and will source the appropriate paths and libraries. All the requested user applications are located under the /opt/apps directory.

To list what applications are available, use the spider sub command:

$ module spider

--------------------------------------------------------------------

The following is a list of the modules currently available:

--------------------------------------------------------------------

  4ti2: 4ti2/1.6.9

    A software package for algebraic, geometric and combinatorial problems on linear spaces

  ACTC: ACTC/1.1

    ACTC converts independent triangles into triangle strips or fans.

  AFNI: AFNI/20150603

    Free software for analysis and display of FMRI data - Homepage: http://afni.nimh.nih.gov

--------------------------------------------------------------------

To learn more about a package execute:

   $ module spider Foo

where "Foo" is the name of a module.

To find detailed information about a particular package you

must specify the version if there is more than one version:

   $ module spider Foo/11.1

--------------------------------------------------------------------

To see a description of a specific package, use the spider sub command again:

$ module spider OpenMPI/4.1.1

-----------------------------------

  OpenMPI: OpenMPI/4.1.1

-----------------------------------

    Description:

      The Open MPI Project is an open source MPI-3 implementation.

    You will need to load all module(s) on any one of the lines below before the "OpenMPI/4.1.1" module is available to load.

      GCC/10.3.0

    Help:

      Description

      ===========

      The Open MPI Project is an open source MPI-3 implementation.      

      More information

      ================

       - Homepage: https://www.open-mpi.org/

To load the module for OpenMPI built with the GCC compilers, for example, use the load sub command:

$ module load GCC OpenMPI

To see a list of modules that you have loaded, use this command:

$ module list

To change to the Intel compiler build of OpenMPI use the swap sub command:

$ module swap GCC icc
  Due to MODULEPATH changes the following have been reloaded:
  1) OpenMPI/1.8.4

To unload all of your modules, use this command:

$ module purge

To make sure a set of modules are loaded automatically at login, use the module save sub command:

$ module load GCC OpenMPI
$ module save

Sometimes a module will not load without explicit dependencies. The following outlines this "error" and what to do

$ module load Boost

Lmod has detected the following error:  These module(s) exist but cannot be loaded as requested: "Boost"

   Try: "module spider Boost" to see how to load the module(s).

$ module spider Boost

-------------------------------------------

  Boost:

---------------------------------------------

    Description:

      Boost provides free peer-reviewed portable C++ source libraries.

     Versions:

        Boost/1.58.0

        Boost/1.61.0

        Boost/1.63.0

        Boost/1.64.0

        Boost/1.66.0

        Boost/1.67.0

        Boost/1.68.0

        Boost/1.69.0

        Boost/1.70.0

        Boost/1.71.0

        Boost/1.72.0

-----------------------------------------

  For detailed information about a specific "Boost" module (including how to load the modules) use the module's full name.

  For example:

     $ module spider Boost/1.72.0

----------------------------------------

$ module load Boost/1.72.0

Lmod has detected the following error:  These module(s) exist but cannot be loaded as requested: "Boost/1.72.0"

   Try: "module spider Boost/1.72.0" to see how to load the module(s).

$ module spider Boost/1.72.0

---------------------------------------

  Boost: Boost/1.72.0

---------------------------------------

    Description:

      Boost provides free peer-reviewed portable C++ source libraries.

    You will need to load all module(s) on any one of the lines below before the "Boost/1.72.0" module is available to load.

      GCC/9.3.0  CUDA/11.0.182  OpenMPI/4.0.3

      GCC/9.3.0  OpenMPI/4.0.3

      iccifort/2020.1.217  CUDA/11.0.182  OpenMPI/4.0.3

      iccifort/2020.1.217  impi/2019.7.217 

    Help:

      Description

      ===========

      Boost provides free peer-reviewed portable C++ source libraries. 

      More information

      ================

       - Homepage: https://www.boost.org/

$ module load GCC/9.3.0  OpenMPI/4.0.3 

$ module load Boost/1.72.0

QOSGrpCpuLimit

QOSGrpCpuLimit means that the limit for number of CPUs allocated within a given QOS has been reached. By default everyone belongs to the nots_commons QOS, however, those groups that have purchased condos have access to additional resources via different QOSes which is why you may still see idle nodes in the cluster. When a QOS fills up the rest of the jobs submitted to the same QOS will hold with the QOSGrpCpuLimit message until more CPUs are freed up.

QOSGrpMemoryLimit

QOSGrpMemoryLimit means that the limit for amount of memory allocated within a given QOS has been reached. By default everyone belongs to the nots_commons QOS, however, those groups that have purchased condos have access to additional resources via different QOSes which is why you may still see idle nodes in the cluster. Your jobs will start running as soon as there are available resources within your QOS.

Available Partitions and System Load

Account Name

Partition Name

Maximum CPUs
Per Node

Maximum CPUs
Per Job

Maximum jobs
running per user

Maximum
run time

commons

commons

40

720

720

24:00:00

commons

interactive

40

256

1

00:30:00

commons scavenge 40 unlimited unlimited 04:00:00

The definition of the queues are as follows:

commons - the general shared computing pool. 

interactive - intended for short jobs, primarily for the purposes of debugging and running interactive jobs.  See our FAQ for information on interactive jobs.

scavenge - intended for jobs that run for 4 hours or less, taking advantage of idle condo resources and possibly reducing your wait time. 

 

Use the following command to determine the partitions with which you have access. Please note in the output the Account column information needs to be provided to your batch script in addition to the partition information.

sacctmgr show assoc cluster=nots user=netID

Advanced usage: Use the following command to determine raw qos information for NOTS to see cluster characteristics. This information is presented in the above table to simplify the output.

sacctmgr show qos Names=nots_commons,nots_interactive,nots_scavenge

Determining Partition Status

An old way to obtain the status of all partitions and their current usage is to run the following SLURM command:

# sinfo PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST

Here is a brief description of the relevant fields:

PARTITION: Name of a partition. Node that the suffix "*" identifies the default partition.
AVAIL: Partition state: up or down.
TIMELIMIT: Maximum time limit for an user job in days-hours:minutes:seconds.
NODES: Count of nodes with this particular configuration by node state in the form "[A]vailable/[I]dle/[O]ther/[T]otal
STATE: State of the nodes.
NODELIST: Names of nodes associated with this configuration/partition.

See the manpage for sinfo for more information

man sinfo

The reason this is an old way of determining cluster status is due to floating partitions and QOS usage. Although the above command can help you determine partition status it does not provide a complete picture. The following command coupled with the above can help connecting things together. We acknowledge the complexity of the commands output and are working on a way to present this information in a more helpful way. For the time being and completeness we provide this information.

sacctmgr show qos

Once you have an executable program and are ready to run it on the compute nodes, you must create a job script that performs the following functions:

  • Use job batch options to request the resources that will be needed (i.e. number of processors, run time, etc.), and
  • Use commands to prepare for execution of the executable (i.e. cd to working directory, source shell environment files, copy input data to a scratch location, copy needed output off of scratch location, clean up scratch files, etc).

After the job script has been constructed you must submit it to the job scheduler for execution. The remainder of this section will describe the anatomy of a job script and how to submit and monitor jobs.

Please note script options are being provided using the long options and not the short options for readability and consistency e.g. --nodes versus -N.

 

 

SLURM Batch Script Options

All jobs must be submitted via a SLURM batch script or invoking sbatch at the command line . See the table below for SLURM submission options.

Option

Description

#SBATCH --job-name=YourJobName

Recommended:  Assigns a job name.  The default is the name of SLURM job script.

#SBATCH --partition=PartitionName

Recommended:  Specify the name of the Partition (queue) to use. Use this to specify the default partition or a special partition i.e. non-condo partiton with which you have access.

#SBATCH --ntasks=2

 Required: The maximum number of tasks per job. Usually used for MPI jobs.

#SBATCH --cpus-per-task=16 Recommended:  The number processes per task. Usually used for OpenMP or multi-threaded jobs.

#SBATCH --time=08:00:00

Required:  The maximum run time needed for this job to run, in days-hh:mm:ss.

#SBATCH --mem-per-cpu=1024M

Recommended: The maximum amount of physical memory used by any single process of the job ([M]ega|[G]iga|[T]era)Bytes.

The value of mem-per-cpu multiplied by cpus on a node (mem-per-cpu X ntasks X cpus-per-task) should not exceed the amount of memory on a node.

 See our FAQ for more details. 

#SBATCH --mail-user=netID@rice.edu Recommended:  Email address for job status messages. Replace netID with your netID as Rice.
#SBATCH --mail-type=ALL Recommended:  SLURM will notify the user via email when the job reaches the following states BEGIN, END, FAIL or REQUEUE.
#SBATCH --nodes=1 --exclusive Optional:  Using both of these options will give your job exclusive access to a node such that no other jobs will use the unallocated resources on the node.
Please see our FAQ for more details on exclusive access.

#SBATCH --output=mypath

Optional:  The full path for the standard output (stdout) and standard error (stderr) "slurm-%j.out" file, where the "%j" is replaced by the job ID.  Current working directory is the default.

#SBATCH --error=mypath

Optional:  The full path for the standard error (stderr) "slurm-%j.out" files. Use this only when you want to separate (stderr) from (stdout). Current working directory is the default.

#SBATCH --export=ALL

Optional:  Exports all environment variables to the job.  See our FAQ for details.

#SBATCH --account=AccountName

#SBATCH --partition=PartitionName

You need to specify the name of the condo account to use a condo on the cluster.

Use the command sacctmgr show assoc user=netID to show which accounts and partitions with which you have access.

#SBATCH --constraint=<feature list> Optional: Constrains job to nodes matching a feature list. Currently available features include processor architectures: ivybridge, broadwell, skylake, cascadelake and fabrics: opath. Features can be combined: --constraint="skylake&opath"
#SBATCH --gres=gpu:1 Optional: Request a number of GPUs per node.

Serial Job Script

A job script may consist of SLURM directives, comments and executable statements. A SLURM directive provides a way of specifying job attributes in addition to the command line options. For example, we could create a myjob.slurm script this way:

myjob.slurm
#!/bin/bash
#SBATCH --job-name=YourJobNameHere
#SBATCH --account=commons
#SBATCH --partition=commons

#SBATCH --ntasks=
1
#SBATCH --mem-per-cpu=1000m
#SBATCH --time=
00:30:00
#SBATCH --mail-user=netID@rice.edu
#SBATCH --mail-type=ALL
echo "My job ran on:" echo $SLURM_NODELIST
if [[ -d $SHARED_SCRATCH/$USER && -w $SHARED_SCRATCH/$USER ]]; then
cd $SHARED_SCRATCH/$USER
srun /path/to/myprogram
fi

This example script will submit a job to the default partition using 1 processor and 1GB of memory per processor, with a maximum run time of 30 minutes.

Definition of --ntasks-per-node

For the clusters the  --ntasks-per-node  option means  tasks per node.

Accurate run time value is strongly recommended

It is important to specify an accurate run time for your job in your SLURM submission script.  Selecting eight hours for jobs that are known to run for much less time may result in the job being delayed by the scheduler due to an overestimation of the time the job needs to run.

How to specify mem

The --mem value represents memory per processor core.  If your --mem value multiplied by the number of tasks (--ntasks-per-node) exceeds the amount of memory per node, your job will not run.  If your job is going to use the entire node, then you should use the --exclusive option instead of the --mem or --ntasks-per-node options (See Here).  It is good practice to specify the --mem option if you are going to be using less than an entire node and thus sharing the node with other jobs.

If you need to debug your program and want to run in interactive mode, the same request above could be constructed like this (via the srun command):

srun --pty --partition=interactive --ntasks=1 --mem=1G --time=00:30:00 $SHELL

For more details on interactive jobs, please see our FAQ on this topic.

SLURM Environment Variables in Job Scripts

When you submit a job, it will inherit several environment variables that are automatically set by SLURM. These environment variables can be useful in your job submission scripts as seen in the examples above. A summary of the most important variables are presented in the table below.

Variable Name

Description

$SHARED_SCRATCH

Location of shared scratch space.  See our FAQ for more details.

$LOCAL_SCRATCH Location of local scratch space on each node.

$SLURM_JOB_NODELIST

Environment variable containing a list of all nodes assigned to the job.

$SLURM_SUBMIT_DIR

Path from where the job was submitted.

Job Launcher (srun)

For jobs that need two or more processors and are compiled with MPI libraries, you must use srun to launch your job.  The job launcher's purpose is to spawn copies of your executable across the resources allocated to your job. We currently support srun for this task and do not support the mpirun or mpiexec launchers. By default srun only needs your executable, the rest of the information will be extracted from SLURM.

The following is an example of how to use srun inside your SLURM batch script. This example will run myMPIprogram as a parallel MPI code on all of the processors allocated to your job by SLURM:

myMPIjob.slurm
#!/bin/bash
#SBATCH --job-name=YourJobNameHere
#SBATCH --account=commons

#SBATCH --partition=commons

#SBATCH --ntasks=
24
#SBATCH --mem-per-cpu=1G

#SBATCH --time=
00:30:00
#SBATCH --mail-user=netID
@rice.edu
#SBATCH --mail-type=ALL
 
echo "My job ran on:" echo $SLURM_NODELIST
if [[ -d $SHARED_SCRATCH/$USER && -w $SHARED_SCRATCH/$USER ]]; then
cd $SHARED_SCRATCH/$USER
srun /path/to/myMPIprogram
fi

This example script will submit a job to the default partition using 24 processor cores and 1GB of memory per processor core, with a maximum run time of 30 minutes.

Your Program must use MPI

The above example assumes that myMPIprogram is a program designed to be parallel (using MPI). If your program has not been parallelized then running on more than one processor will not improve performance and will result in wasted processor time and could result in multiple copies of your program being executed.

The following example will run myMPIprogram on only four processors even if your batch script requested more than four.

srun -n 4 /path/to/myMPIprogram

To ensure that your job will be able to access an mpi runtime, you must load an mpi module before submitting your job as follows:

module load GCC OpenMPI

Submitting and Monitoring Jobs

Once your job script is ready, use sbatch to submit it as follows:

sbatch /path/to/myjob.slurm

This will return a jobID number while the output and error stream of the job will be saved to one file inside the directory where the job was submitted, unless you specified otherwise.

The status of the job can be obtained using SLURM commands.  See the table below for a list of commands:

Command

Description

squeue

Show a detailed list of all submitted jobs.

squeue -j jobID

Show a detailed description of the job given by jobID.

squeue -- start -j jobID

Gives an estimate of the expected start time of the job given by jobID.

There are variations to these commands that can also be useful.  They are described below:

Command

Description

squeue -l

Show a list of all running jobs.

squeue -u username

Show a list of all jobs in queue owned by the user specified by username.

scontrol show job jobID

To get a verbose description of the job given by jobID. The output can be used as a template when you are attempting to modify a job.

There are many different states that a job can be after submission: BOOT_FAIL (BF), CANCELLED (CA), COMPLETED (CD), CONFIGURING (CF), COMPLETING (CG), FAILED (F), NODE_FAIL (NF), PENDING (PD), PREEMPTED (PR), RUNNING (R), SUSPENDED (S), TIMEOUT (TO), or SPECIAL_EXIT (SE). The squeue command with no arguments will list all jobs in their current state.  The most common states are described below.

Running (R): These are jobs that are running.

Pending (PD): These jobs are eligible to run but there is simply not enough resources to allocate to them at this time.

Deleting Jobs

A job can be deleted by using the scancel command as follows:

scancel jobID

Compiling and Optimizing

Several programming models are supported on this system. Programs that are sequential, parallel (within a node) or distributed can be run. Sequential programs require one processor to run. Parallel and distributed programs utilize multiple processors concurrently. Parallel programs are a subset of distributed programs. Generally speaking, distributed computing involve parametric sweeps, task farming, etc. Message passing, threaded applications generally fit under the scope of parallel computing.  SPMD (single process, multiple data) is one of the most popular method of parallelism, where a single executable works on its own data. 

The supported compilers on this system are Intel and GCC. The MPI implementations from OpenMPI and Intel are available for both compilers and can be loaded upon demand using the module command. The preferred compiler for this system is Intel.

Compiling Serial Code

To compile serial code you must first load the appropriate compiler environment module .  To load the Intel compiler, execute this command:

module load iccifort

Once the environment is set, you can compile your program with one of the following (using Intel compiler as an example):

icc -o executablename sourcecode.c
icc -o executablename sourcecode.cc
ifort -o executablename sourcecode.f77 
ifort -o executablename sourcecode.f90

When invoked as described above, the compiler will perform the preprocessing, compilation, assembly and linking stages in a single step. The output file (or executable) is specified by executablename and the source code file is specificed by sourcecode.f77, for example. Omitting the -o executablename option will result in the executable being named a.out by default. For additional instructions and advanced options please view the online manual pages for each compiler (i.e. execute the command man ifort ).

Compiling Parallel Code

To compile a parallel version of your code that has MPI library calls, use the appropriate MPI library. Again, use module command to load the appropriate compiler environment as follows (Intel versions highly recommended):

module command

Description

module load GCC OpenMPI

For gcc compiled OpenMPI

module load iccifort OpenMPI

For Intel compiled OpenMPI

module load GCC impi For gcc compiled Intel MPI
module load iccifort impi For Intel compiled Intel MPI

To compile your code you will have use the MPI compiler wrappers that are currently in your default path. The MPI wrappers are responsible for invoking the compiler, linking your program with the MPI library and setting the MPI include files.

Once the environment is set, you can compile your program with one of the following:

mpicc -o executablename mpi_sourcecode.c
mpicxx -o executablename mpi_sourcecode.cc mpif77 -o executablename mpi_sourcecode.f77
mpif90 -o executablename mpi_sourcecode.f90

When invoked as described above, the compiler will perform the preprocessing, compilation, assembly and linking stages in a single step. The output file (or executable) is specified by executablename and the source code file is specificed by mpi_sourcecode.f77, for example. Omitting the -o executablename option will result in the executable being named a.out by default. For additional instructions and advanced options please view the online manual pages for each compiler (i.e. execute the command man mpif77 ).

GNU Compiler

The GNU compiler is installed as part of the Red Hat Enterprise Linux distribution. Use man gcc to view the online manual for the C and C++ compiler, and man gfortran to view the online manual for the Fortran compiler.

Examples

There are various examples of job scripts and other helpful files which you can peruse on NOTS.

Login to NOTS via SSH and look at the directories and files.

Examples
cd /opt/apps/examples
ls -al

Getting Help

Request Help with the Center for Research Computing Resources



KeywordsCRC Getting Started on NOTS mkl gres QOSGrpCpuLimit QOSGrpMemoryLimit QOSGrpMemLimit QOSlimit NSF Citation   Doc ID108237
OwnerBryan R.GroupRice U
Created2021-01-11 11:34:54Updated2024-08-22 14:23:07
SitesRice University
Feedback  18   11