Splunk Inc.

05/09/2024 | News release | Distributed by Public on 05/09/2024 18:53

What Is AI Infrastructure

AI infrastructure is to the technology stack that runs AI workloads. Any AI technology stack consists of:

  • High Performance Computing (HPC) hardware and networking components
  • The platform layer
  • Data workloads
  • ML models

AI technologies are highly resource intensive and typically rely on bespoke infrastructure as organizations aim to maximize compute efficiency, reliability and scalability of their AI infrastructure.

(Related reading: infrastructure security & Splunk Infrastructure Monitoring.)

Components in AI infrastructure

Let's review the key components of an artificial intelligence infrastructure. (New to IT? Start with this IT infrastructure beginner's guide.)

Compute infrastructure

The most interesting AI infrastructure component for AI developers is the specialized hardware technology that is used to train and run AI models. A GPU architecture contains:

  • Parallel processing cores and threads
  • High memory bandwidth
  • Optimized memory hierarchy
  • Specialized processing units such as Tensor Cores to accelerate parallel matrix multiplication operations as part of model training and inference

HPC CPUs are used more commonly for standardized computing tasks that may be latency-sensitive such as:

  • Data loading and management
  • I/O operations
  • Debugging and development
  • Model deployment
  • Execution

(CPUs vs. GPUs: when to use each.)

Storage infrastructure

AI model performance is highly dependent on the data used to train it. In fact, the success of LLMs such as ChatGPT largely comes down to its training data.

While data may be free and publicly available, it takes an efficient storage infrastructure and data platform to ingest, process, analyze and train AI models on large volumes of information at scale. The storage infrastructure consists of:

Key considerations associated with the AI storage infrastructure include scalability (with regards to cost of storage), I/O performance, security and compliance.

Networking infrastructure

AI workloads require high performance network fabrics that can handle trillions of AI model executions and compute processes across distributed hardware clusters. The network must be capable of handling load balancing for elephant flow data workloads, especially when the network architecture is developed with hierarchical patterns for efficient data handling.

The performance impact at the physical layer should be minimum - high I/O in real-time datastream processing can lead to packet loss. The network should:

Platform & application layers

The platform and software/application stack provides resources specific to AI development and model deployment.

ML frameworks such as PyTorch, GPU programming toolkits such as CUDA, and other model-specific frameworks speed up the AI development process. These software tools are typically provisioned as containerized systems that isolate AI development from its underlying AI hardware infrastructure.

Finally, MLOps is adopted to automate the management of:

  • The AI infrastructure and platform
  • Tooling delivery
  • Other infrastructure operations such as resource provisioning, risk management and platform architecture design

Monitoring and optimization runs at the infrastructure layer level, using AI-driven monitoring and analytics tools that analyze traffic from a distributed AI infrastructure including cloud-based and on-premise systems.

(Understand the layers: read about the OSI networking model.)

Downstream AI infrastructure

AI models are deployed in production environments either:

  • For downstream AI tasks such as edge AI and IoT computing.
  • As part of another service that integrates with your AI data platform to run AI workloads.

The infrastructure running these services is not part of the AI data and processing pipeline but is integrated via API calls to deliver a secondary downstream service.

For example, Meta uses its Llama 3 GPU clusters primarily for generative AI use cases. And as it expands its GPU cluster portfolio, secondary services - such as ads, search, recommender systems, and ranking algorithms - can take advantage of its genAI models.

All of this requires an expansive data lake platform that can:

  • Ingest data in real-time.
  • Process it using advanced AI models.
  • Finally, respond to user queries efficiently as an integrated downstream service.

(Learn how Splunk AI accelerate detection, investigation and response.)

AI Infrastructure in the real world

Now, let's look at a specific example of an AI infrastructure.

Meta has recently published the details on its generative AI infrastructure that is used to train and run its latest LLM models including Llama 3. The infrastructure includes two GPU clusters containing 24,576 flagship NVIDIA H100 GPUs. This is an upgrade from its previous AGI infrastructure that contained 16,000 NVIDIA A100 GPUs.

The company further plans to extend its computing capacity by deploying 350,000 H100 chips by the end of this year.

These clusters run on two different network fabric systems:

  • One network system is designed with RDMA over Converged Ethernet (RoCE).
  • The other is based on the Nvidia Quantum2 InfiniBand network fabric.

Both solutions offer a 400Gbps endpoint speed. Meta uses its own AI platform called Grand Teton open sourced as part of its Open Compute Project (OCP) initiative. The platform is based on the Open Rack v3 (ORV3) network system design that has been adopted widely as an industry standard. The ORV3 ecosystem includes cooling capabilities optimized for its AI GPU clusters.

Storage is based on the Meta's Tectonic filesystem that consolidates multitenant filesystem instances for exabyte-scale distributed data workloads. Other storage deployments include high capacity E1.S SSD storage systems based on the YV3 Sierra Point server platform.

AI infrastructure requires significant resources

Certainly AI is on its way to changing a lot about how we work and use the internet today. However, it's always important to understand the resources - power, money, limited natural resources - that go into running any AI.