CRC RANGE Description
RANGE (Rice AI Networked GPU Engine), the Center for Research Computing’s next-generation NVIDIA GPU cluster designed to accelerate AI and data-driven research across all disciplines. RANGE combines cutting-edge NVIDIA GPUs with ultra-fast NVIDIA NDR networking to provide both increased capacity and capability for Rice researchers. RANGE has access to the CRC'S VAST high performance data share and the RHF (Research High-Capacity Facility) to store large datasets. RANGE utilizes the SLURM batch scheduler, LMOD for managing software environments, and provides preinstalled suites of many commonly used AI/ML software tools.
Citation
If you use RANGE to support your research activities, please acknowledge (in publications, on your project web pages, …) our efforts to maintain this infrastructure for your use. An example acknowledgment that can be used follows. Feel free to modify wording for your specific needs but please keep the essential information:
This work was supported in part by the RANGE cluster operated by Rice University's Center for Research Computing (CRC).
Node Configurations
RANGE has four nodes designed to provide capacity for educational use and the various courses, workshops, and trainings that increasingly include AI workloads. Examples of capacity workloads would include tuning, repurposing, and using existing foundation models in the 10 billion parameter range.
|
Hardware |
Dell PowerEdge R760xa |
|
GPU |
4-way H100 (4x94 GB) NVL |
|
CPU |
Dual Intel Xeon Gold 6548Y+ 2.5 Ghz, 32C/64T |
|
RAM |
16x32 GB RDIMMs, 5600MT/s (512 GB of RAM each) |
|
Network |
1 Mellanox ConnectX-6 DX Dual Port 100GbE card |
|
RDMA/GPU Direct Storage |
1 Nvidia ConnectX-7 Single Port NDR200 OSFP PCIe card |
RANGE has 8 nodes designed to support the growing models and increasing computational complexity of modern ML algorithms. Capability systems will function both as production resources for the needs of campus research groups as well as prototyping environments for researchers interested in the national centers and their AI computing efforts. A practical example characterizing the scale of problems requiring capability systems would be training of the BLOOM Large Language Model (https://arxiv.org/abs/2211.05100)
|
Hardware |
Dell PowerEdge XE9680 |
|
GPU |
8-way H200 (8x141GB) SXM |
|
CPU |
Dual Intel Xeon Platinum 8470 2.0 Ghz, 52C/104T |
|
RAM |
32x64 GB RDIMMs 5600MT/s (2 TB of RAM each) |
|
Network |
1 Mellanox ConnectX-6 DX Dual Port 100GbE card |
|
RDMA network |
Eight Nvidia ConnectX-7 400 Gbps NDR OSFP PCIe cards (one dedicated per GPU) |
|
GPU Direct Storage |
1 Nvidia ConnectX-7 Single Port NDR200 OSFP PCIe card |
Software/OS Environment
-
Operating system: Red Hat Enterprise Linux 9.5
-
SLURM scheduler: version 24.11
-
CUDA libraries: Run module spider CUDA on the cluster to see versions available
