Micron Technology Inc.

11/28/2023 | Press release | Distributed by Public on 11/28/2023 12:32

WEKA storage with the Micron 6500 ION SSD supports 256 AI accelerators

Micron recently published our results for MLPerf Storage v0.5 on the Micron® 9400 NVMe™ SSD. These results highlight the high-performance NVMe SSD as a local cache in an AI server, and the Micron 9400 NVMe SSD performs extremely well for that use case. However, most AI training data lives not in local cache but on shared storage. For SC23, we decided to test the same MLPerf Storage AI workload on a WEKA storage cluster powered by the 30TB Micron 6500 ION NVMe SSD.

WEKA is a distributed, parallel filesystem designed for AI workloads, and we wanted to know how the MLPerf Storage AI workload scales on a high-performance SDS solution. The results are enlightening, helping us make sizing recommendations for current-generation AI systems and hinting at the massive throughput future AI storage systems will require.

WEKA is a distributed, parallel filesystem designed for AI workloads, and we wanted to know how the MLPerf Storage AI workload scales on a high-performance SDS solution. The results are enlightening, helping us make sizing recommendations for current-generation AI systems and hinting at the massive throughput future AI storage systems will require.

First, a quick refresher on MLPerf Storage
MLCommons maintains and develops six different benchmark suites and is developing open datasets to support future state-of-the-art model development. The MLPerf Storage Benchmark Suite is the latest addition to the MLCommons' benchmark collection.

MLPerf Storage sets out to address two challenges, among others, when characterizing the storage workload for AI training systems - the cost of AI accelerators and the small size of available datasets.

For a deeper dive into the workload generated by MLPerf Storage and a discussion of the benchmark, see our previous blog posts:
Next, let's go over the WEKA cluster under test
My teammate, Sujit, wrote a post earlier this year describing the performance of the cluster in synthetic workloads. See that post for the full results.

The cluster is made up of six storage nodes, and each node is configured with the following: In aggregate, this cluster provides 838TB of capacity and, for high queue-depth workloads, achieves 200 GB/s.

Finally, let's review how this cluster performs in MLPerf Storage
Quick note: The results presented here are unvalidated as they have not been submitted to MLPerf Storage for review. Also, the MLPerf Storage benchmark is undergoing changes from v0.5 to the next version for the first 2024 release. The numbers presented here use the same methodology as the v0.5 release (independent datasets for each client, independent clients, and accelerators in a client share a barrier).

The MLPerf Storage benchmark emulates NVIDIA® V100 accelerators in the 0.5 version. The NVIDIA DGX-2 server has 16 V100 accelerators. For this testing, we show the number of clients supported on the WEKA cluster where each client emulates 16 V100 accelerators, like in the NVIDIA DGX-2.

Additionally, v0.5 of the MLPerf Storage benchmark implements two different models, Unet3D and BERT. Through testing, we find that BERT does not generate significant storage traffic, so we're going to focus on Unet3D for the testing here. (Unet3D is a 3D medical imaging model.)

This plot shows the total throughput to the storage system for a given number of client nodes. Remember, each node has 16 emulated accelerators. Furthermore, to be considered a "success," a given quantity of nodes and accelerators need to maintain greater than 90% accelerator utilization. If the accelerators drop below 90%, that represents idle time on the accelerators as they wait for data.

Here we see that the six-node WEKA storage cluster supports 16 clients, each emulating 16 accelerators - for a total of 256 emulated accelerators - and reaching 91 GB/s of throughput.

This performance is like 16 NVIDIA DGX-2 systems (with 16 V100 GPUs each), which is a remarkably high number of AI systems supported by a six-node WEKA cluster.

The V100 is a PCIe Gen3 GPU, and the pace-of-performance increases in NVIDIA's GPU generations are far surpassing platform and PCIe generations. In a single-node system, we find that an emulated NVIDIA A100 GPU is four times faster in this workload.

With a maximum 91 GB/s throughput, we can estimate that this WEKA deployment would support 8 DGX A100 systems (with 8 A100 GPUs each).

Looking further into the future at H100 / H200 (PCIe Gen5) and X100 (PCIe Gen6), cutting-edge AI training servers are going to push a massive amount of throughput.

For today, WEKA storage and the Micron 6500 NVMe SSD are the perfect combination of capacity, performance and scalability for your AI workloads.

Stay tuned as we continue to explore storage for AI!

Wes Vaske

Wes Vaske is a Senior Member of Technical Staff on the Micron Data Center Workloads Engineering team in Austin Texas. He analyzes enterprise workloads to understand the performance effects of Flash and DRAM devices on applications and provides 'real-life' workload characterization to internal design & development teams. Wes's specific focus is Artificial Intelligence applications and developing the tools for tracing and system observation.