SPEChpc™ 2021 Small Result

NVIDIA Corporation

NVIDIA DGX A100 System (AMD EPYC 7742 2.25GHz, Tesla A100-SXM-80 GB)

SPEChpc 2021_sml_base = 13.60

SPEChpc 2021_sml_peak = 14.90

hpc2021 License:	019	Test Date:	Sep-2022
Test Sponsor:	NVIDIA Corporation	Hardware Availability:	Jul-2020
Tested by:	NVIDIA Corporation	Software Availability:	Mar-2022

Benchmark result graphs are available in the PDF report.

Results Table

Benchmark	Base									Peak
Benchmark	Model	Ranks	Thrds/Rnk	Seconds	Ratio	Seconds	Ratio	Seconds	Ratio	Model	Ranks	Thrds/Rnk	Seconds	Ratio	Seconds	Ratio	Seconds	Ratio
SPEChpc 2021_sml_base					13.60
SPEChpc 2021_sml_peak					14.90
Results appear in the order in which they were run. Bold underlined text indicates a median measurement.
605.lbm_s	ACC	16	16	79.2	19.60	79.1	19.60	79.2	19.60	ACC	16	16	74.1	20.90	74.2	20.90	74.1	20.90
613.soma_s	ACC	16	16	92.9	17.20	93.2	17.20	94.2	17.00	ACC	8	32	56.6	28.30	56.7	28.20	57.0	28.10
618.tealeaf_s	ACC	16	16	2090	9.81	2090	9.81	2090	9.80	ACC	8	32	2050	9.99	2050	9.99	2050	9.98
619.clvleaf_s	ACC	16	16	1370	12.00	1370	12.10	1370	12.00	ACC	32	8	1320	12.50	1320	12.50	1320	12.50
621.miniswp_s	ACC	16	16	49.0	22.50	49.1	22.40	49.1	22.40	ACC	16	16	49.0	22.50	49.1	22.40	49.1	22.40
628.pot3d_s	ACC	16	16	1530	11.00	1500	11.20	1500	11.20	ACC	16	16	1480	11.30	1480	11.30	1480	11.30
632.sph_exa_s	ACC	16	16	2680	8.57	2680	8.58	2690	8.55	ACC	32	8	2240	10.30	2240	10.30	2250	10.20
634.hpgmgfv_s	ACC	16	16	1550	6.27	1560	6.27	1550	6.27	ACC	32	8	1490	6.52	1490	6.53	1500	6.52
635.weather_s	ACC	16	16	89.1	29.20	89.1	29.20	89.1	29.20	ACC	16	16	89.1	29.20	89.1	29.20	89.1	29.20

Hardware Summary
Type of System:	SMP
Compute Node:	DGX A100
Compute Nodes Used:	1
Total Chips:	2
Total Cores:	128
Total Threads:	256
Total Memory:	2 TB
Max. Peak Threads:	32

Software Summary
Compiler:	C/C++/Fortran: Version 22.3 of NVIDIA HPC SDK for Linux
MPI Library:	OpenMPI Version 4.1.2rc4
Other MPI Info:	HPC-X Software Toolkit Version 2.10
Other Software:	None
Base Parallel Model:	ACC
Base Ranks Run:	16
Base Threads Run:	16
Peak Parallel Models:	ACC
Minimum Peak Ranks:	8
Maximum Peak Ranks:	32
Max. Peak Threads:	32
Min. Peak Threads:	8

Node Description: DGX A100

Hardware
Number of nodes:	1
Uses of the node:	compute
Vendor:	NVIDIA Corporation
Model:	NVIDIA DGX A100 System
CPU Name:	AMD EPYC 7742
CPU(s) orderable:	2 chips
Chips enabled:	2
Cores enabled:	128
Cores per chip:	64
Threads per core:	2
CPU Characteristics:	Turbo Boost up to 3400 MHz
CPU MHz:	2250
Primary Cache:	32 KB I + 32 KB D on chip per core
Secondary Cache:	512 KB I+D on chip per core
L3 Cache:	256 MB I+D on chip per chip (16 MB shared / 4 cores)
Other Cache:	None
Memory:	2 TB (32 x 64 GB 2Rx8 PC4-3200AA-R)
Disk Subsystem:	OS: 2TB U.2 NVMe SSD drive Internal Storage: 30TB (8x 3.84TB U.2 NVMe SSD drives)
Other Hardware:	None
Accel Count:	8
Accel Model:	Tesla A100-SXM-80 GB
Accel Vendor:	NVIDIA Corporation
Accel Type:	GPU
Accel Connection:	NVLINK 3.0, NVSWITCH 2.0 600 GB/s
Accel ECC enabled:	Yes
Accel Description:	See Notes
Adapter:	NVIDIA ConnectX-6 MT28908
Number of Adapters:	8
Slot Type:	PCIe Gen4
Data Rate:	200 Gb/s
Ports Used:	1
Interconnect Type:	InfiniBand / Communication
Adapter:	NVIDIA ConnectX-6 MT28908
Number of Adapters:	2
Slot Type:	PCIe Gen4
Data Rate:	200 Gb/s
Ports Used:	2
Interconnect Type:	InfiniBand / FileSystem

Software
Accelerator Driver:	NVIDIA UNIX x86_64 Kernel Module 470.103.01
Adapter:	NVIDIA ConnectX-6 MT28908
Adapter Driver:	InfiniBand: 5.4-3.4.0.0
Adapter Firmware:	InfiniBand: 20.32.1010
Adapter:	NVIDIA ConnectX-6 MT28908
Adapter Driver:	Ethernet: 5.4-3.4.0.0
Adapter Firmware:	Ethernet: 20.32.1010
Operating System:	Ubuntu 20.04 5.4.0-121-generic
Local File System:	ext4
Shared File System:	Lustre
System State:	Multi-user, run level 3
Other Software:	None

Compiler Invocation Notes

 Binaries built and run within a NVHPC SDK 22.3 CUDA 11.0 Ubuntu 20.04
  Container available from NVIDIA GPU Cloud (NGC):
   https://ngc.nvidia.com/catalog/containers/nvidia:nvhpc
   https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc/tags

Submit Notes

The config file option 'submit' was used.
 MPI startup command:
   srun command was used to start MPI jobs.

 Individual Ranks were bound to the NUMA nodes, GPUs and NICs using this "wrapper.GPU" bash-script for the case of 1 rank per GPU

   ln -s -f libnuma.so.1 /usr/lib/x86_64-linux-gnu/libnuma.so
   export LD_LIBRARY_PATH+=:/usr/lib/x86_64-linux-gnu
   export LD_RUN_PATH+=:/usr/lib/x86_64-linux-gnu
   declare -a NUMA_LIST
   declare -a  GPU_LIST
   declare -a  NIC_LIST
   NUMA_LIST=($NUMAS)
   GPU_LIST=($GPUS)
   NIC_LIST=($NICS)
   export UCX_NET_DEVICES=${NIC_LIST[$SLURM_LOCALID]}:1
   export OMPI_MCA_btl_openib_if_include=${NIC_LIST[$SLURM_LOCALID]}
   export CUDA_VISIBLE_DEVICES=${GPU_LIST[$SLURM_LOCALID]}
   numactl -l -N ${NUMA_LIST[$SLURM_LOCALID]} $*

 and this "wrapper.MPS" bash-script for the oversubscribed case.

   ln -s -f libnuma.so.1 /usr/lib/x86_64-linux-gnu/libnuma.so
   export LD_LIBRARY_PATH+=:/usr/lib/x86_64-linux-gnu
   export LD_RUN_PATH+=:/usr/lib/x86_64-linux-gnu
   declare -a NUMA_LIST
   declare -a  GPU_LIST
   declare -a  NIC_LIST
   NUMA_LIST=($NUMAS)
   GPU_LIST=($GPUS)
   NIC_LIST=($NICS)
   NUM_GPUS=${#GPU_LIST[@]}
   RANKS_PER_GPU=$((SLURM_NTASKS_PER_NODE / NUM_GPUS))
   GPU_LOCAL_RANK=$((SLURM_LOCALID / RANKS_PER_GPU))
   export UCX_NET_DEVICES=${NIC_LIST[$GPU_LOCAL_RANK]}:1
   export OMPI_MCA_btl_openib_if_include=${NIC_LIST[$GPU_LOCAL_RANK]}
   set +e
   nvidia-cuda-mps-control -d 1>&2
   set -e
   export CUDA_VISIBLE_DEVICES=${GPU_LIST[$GPU_LOCAL_RANK]}
   numactl -l -N ${NUMA_LIST[$GPU_LOCAL_RANK]} $*
   if [ $SLURM_LOCALID -eq 0 ]
   then
       echo 'quit' | nvidia-cuda-mps-control 1>&2
   fi

General Notes

Full system details documented here:
https://images.nvidia.com/aem-dam/Solutions/Data-Center/gated-resources/nvidia-dgx-superpod-a100.pdf

Environment variables set by runhpc before the start of the run:
SPEC_NO_RUNDIR_DEL = "on"

Platform Notes

 Detailed A100 Information from nvaccelinfo
 CUDA Driver Version:           11040
 NVRM version:                  NVIDIA UNIX x86_64 Kernel Module 470.7.01
 Device Number:                 0
 Device Name:                   NVIDIA A100-SXM-80 GB
 Device Revision Number:        8.0
 Global Memory Size:            85198045184
 Number of Multiprocessors:     108
 Concurrent Copy and Execution: Yes
 Total Constant Memory:         65536
 Total Shared Memory per Block: 49152
 Registers per Block:           65536
 Warp Size:                     32
 Maximum Threads per Block:     1024
 Maximum Block Dimensions:      1024, 1024, 64
 Maximum Grid Dimensions:       2147483647 x 65535 x 65535
 Maximum Memory Pitch:          2147483647B
 Texture Alignment:             512B
 Clock Rate:                    1410 MHz
 Execution Timeout:             No
 Integrated Device:             No
 Can Map Host Memory:           Yes
 Compute Mode:                  default
 Concurrent Kernels:            Yes
 ECC Enabled:                   Yes
 Memory Clock Rate:             1593 MHz
 Memory Bus Width:              5120 bits
 L2 Cache Size:                 41943040 bytes
 Max Threads Per SMP:           2048
 Async Engines:                 3
 Unified Addressing:            Yes
 Managed Memory:                Yes
 Concurrent Managed Memory:     Yes
 Preemption Supported:          Yes
 Cooperative Launch:            Yes
   Multi-Device:                Yes
 Default Target:                cc80

Compiler Version Notes

==============================================================================
 CC  605.lbm_s(base, peak) 613.soma_s(base, peak) 618.tealeaf_s(base, peak)
      621.miniswp_s(base, peak) 634.hpgmgfv_s(base, peak)
------------------------------------------------------------------------------
nvc 22.3-0 64-bit target on x86-64 Linux -tp zen2-64 
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
------------------------------------------------------------------------------

==============================================================================
 CXXC 632.sph_exa_s(base, peak)
------------------------------------------------------------------------------
nvc++ 22.3-0 64-bit target on x86-64 Linux -tp zen2-64 
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
------------------------------------------------------------------------------

==============================================================================
 FC  619.clvleaf_s(base, peak) 628.pot3d_s(base, peak) 635.weather_s(base,
      peak)
------------------------------------------------------------------------------
nvfortran 22.3-0 64-bit target on x86-64 Linux -tp zen2-64 
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
------------------------------------------------------------------------------

Base Compiler Invocation

C benchmarks:

mpicc

C++ benchmarks:

mpicxx

Fortran benchmarks:

mpif90

Base Portability Flags

605.lbm_s:	-DSPEC_OPENACC_NO_SELF
632.sph_exa_s:	--c++17

Base Other Flags

C benchmarks:

-Ispecmpitime -w

C++ benchmarks:

-Ispecmpitime -w

Fortran benchmarks (except as noted below):

	-w
619.clvleaf_s:	-Ispecmpitime -w

Peak Compiler Invocation

C benchmarks:

mpicc

C++ benchmarks:

mpicxx

Fortran benchmarks:

mpif90

Peak Portability Flags

605.lbm_s:	-DSPEC_OPENACC_NO_SELF
632.sph_exa_s:	--c++17

Peak Optimization Flags

C benchmarks:

605.lbm_s:	-O3 -DSPEC_ACCEL_AWARE_MPI -acc=gpu -gpu=cuda11.0 -gpu=cc80 -gpu=maxregcount:128 -Mstack_arrays -Mfprelaxed -Mnouniform -tp=zen2 -mp
613.soma_s:	-fast -DSPEC_ACCEL_AWARE_MPI -acc=gpu -gpu=cuda11.0 -gpu=cc80 -Mstack_arrays -Mfprelaxed -Mnouniform -tp=zen2
618.tealeaf_s:	-O3 -DSPEC_ACCEL_AWARE_MPI -acc=gpu -gpu=cuda11.0 -gpu=cc80 -Mstack_arrays -Mfprelaxed -Mnouniform -tp=zen2 -mp -Msafeptr
621.miniswp_s:	basepeak = yes
634.hpgmgfv_s:	-fast -DSPEC_ACCEL_AWARE_MPI -acc=gpu -gpu=cuda11.0 -gpu=cc80 -Mstack_arrays -Mfprelaxed -Mnouniform -tp=zen2 -Msafeptr

C++ benchmarks:

-fast -DSPEC_ACCEL_AWARE_MPI -acc=gpu -gpu=cuda11.0 -gpu=cc80 -Mstack_arrays -Mfprelaxed -Mnouniform -tp=zen2 -Mquad

Fortran benchmarks:

619.clvleaf_s:	-DSPEC_ACCEL_AWARE_MPI -fast -acc=gpu -gpu=cuda11.0 -gpu=cc80 -Mstack_arrays -Mfprelaxed -Mnouniform -tp=zen2 -mp
628.pot3d_s:	-DSPEC_ACCEL_AWARE_MPI -fast -acc=gpu -gpu=cuda11.0 -gpu=cc80 -Mstack_arrays -Mfprelaxed -Mnouniform -tp=zen2
635.weather_s:	basepeak = yes

Peak Other Flags

C benchmarks:

-Ispecmpitime -w

C++ benchmarks:

-Ispecmpitime -w

Fortran benchmarks (except as noted below):

	-w
619.clvleaf_s:	-Ispecmpitime -w

The flags file that was used to format this result can be browsed at
http://www.spec.org/hpc2021/flags/nv2021_flags_v1.0.3.2022-11-03.html.

You can also download the XML flags source by saving the following link:
http://www.spec.org/hpc2021/flags/nv2021_flags_v1.0.3.2022-11-03.xml.