DGX-1 Hyperscale-class GPU Computing

The Nvidia DGX-1 is a new HPC system (not just a server) that features the Tesla P100 accelerators for GPU computing. It includes 2x Intel Xeon E5-2698 v3 (16 core, Haswell-EP) and 8 P100s for 28,672 CUDA
cores and 128GB of shared VRAM. The DGX-1 is rated to be able to hit 170 FP16 TFLOPs of performance (or 85 FP32 TFLOPs) inside of 3Us.

The P100 has a new form factor and connector requiring a completely new infrastructure to run. The 8 P100s are installed in a hybrid mesh cube configuration, making full use of the NVLink interconnect to offer a significant amount of memory bandwidth between the GPUs. Each NVLink offers a bidirectional 20GB/sec up 20GB/sec down, with 4 links per GP100 GPU, for an aggregate bandwidth of 80GB/sec up and another
80GB/sec down.

The DGX-1 system runs Canonical’s Ubuntu Server and drivers for the Pascal GPUs created by Nvidia. Note that most hyperscalers are deploying large CPU / GPU clusters to train their neural networks using Ubuntu. The system also includes Nvidia’s Deep Learning SDK and its DIGITS GPU training system as well as the CUDA programming environment and a bunch of nice machine learning frameworks all bundled and tuned for the Pascal GPUs. Nvidia invested heavily in NVLink, their higher-speed interconnect to enable fast memory access between GPUs, and unified memory between the GPU and CPU.

The downside is you are locked into the Intel / Nvidia combo for inefficient integrated x86 CPU / GPU computing (Nvidia doesn't have an x86 license). The lack of competition in this space is disconcerting.

The upside is that Intel has fast x86 CPUs and fast storage in Octane - and Nvidia has a nice accessible language in CUDA.

The DGX-1 allows high performance for deep learning and neural network applications. Features include:

  • 2x Intel Xeon E5-2698 v3 (16 core, Haswell-EP)
  • 8 P100s for 28,672 CUDA cores and 128GB of shared VRAM
  • High speed, high bandwidth interconnect for maximum application scalability
  • HBM2 - Fast, high capacity, extremely efficient stacked GPU memory architecture
  • Unified Memory and Compute Preemption - significantly improved programming model
  • 16nm FinFET -enables more features and improved power efficiency.