CIOS Expect Extended Value from their Artificial Intelligence (AI) Investments, Including Increased Productivity, Enhanced Customer Experience (CX) and Digital Transformation. As a result, gartner client interest in deploying ai infrastructure – Including Graphics Processing Units (GPUS) and AI servers – have grown substantially.
Specifically, Client Enquiries Regarding Gpus and ai infrastructure Increased Nearly Fourfold Annually from October 2022 Through October 2024. Clients are exploring the use of hosted, cloud and on-premise-based options for gpu deploment. In some cases, enterprises will select a “full-stack” ai offering that includes gpu, computer, storage and networking in a bundled package. In other instances, enterprises will select and deploy the pieces, individally selected and integrated. The requirements of AI Workloads are different from most existing datacentre workloads.
Multiple Interconnect Technologies are available to support gpu connectivity. A Common Question from Gartner Clients is: “Should I use Ethernet, Infiniband or Nvlink To connect to gpu clusters? ” All three approaches can be valid, depending on the scenario.
These technologies are not mutually exclusive. Enterprises can deploy them in connection with one another (for example, In fi niband or Ethernet) To scale out beyond a rack. A Common Misconception is only in fi niband or a supplier -proprietary interconnect technology (such as nvlink) can delivery appropriate performance and relayability.
However, Gartner recommends that enterprises deploy ethernet over alternative technologies, such as in fi niband, for gpu clusters up to several thosand. Ethernet-spoken infrastructure can provide the Necessary Relibility and Performance, and there is widesPred Enterprise Experience with the Technology. Furthermore, a broad ecosystem of suppliers is associateed with ethernet technology.
Optimise Network Deployments for GPU Traffic
The current state of practice for computer processing unit (CPU) -Based, General-purpose computing works is a leaf/spine network topology.
However, Leaf-spine topologies Are not always optimal for ai workloads. In addition, running ai workloads colocated with existing datacentre networks can create noisy neighbour effects that degrade performance both for ai and existing worklights. This can delay the processing and job completes time for ai workloads, which is highly intelligent.
In a buildingout of ai infrastructure, networking switches typically represant 15% or less of the cost. As a result, saving money by using existing switches often leads to suboptimal overall price/performance for the Ai Workload Investment. As a result, gartner makes several recommendations.
Due to the unique traffic requirements and gpu costs, gartner sugges building out dedicated physical switches for GPU Connectivity. Furthermore, raather than defaulting to a leaf-spine topology, Gartner also sugges using a minimal number of Physical switches to reduce physical “hops”. This should Ultimately lead to a leaf-spine topology, as well as other topologies, including single-sawitch, two-sawitch, full-month, cube-money and dragonfly.
Avoid using the same switches for other generalized datacentre computing needs. For GPU Clusters below 500 gpus, one or two physical switches is ideal. For Organizations with more than 500 gpus, Gartner Advises It Decision-Makers to build out a dedicated ai ethernet fabric. This is likely to require a deviation from the standard, state-of-practice, top-of-rack topologies towards turards middle-of-rov and/or modular switching implements.
Enhance Ethernet Builds
Gartner recommends using dedicated switches for GPU Connectivity. When Deploying Ethernet (Compared with Infiniband or Shelf/Row Optimized), use switches with special requirements. Switches need to support:
- High-speed interface for gpus, including 400gbps access ports and Above.
- Support for Lossless Ethernet, Including Advanced, Congestion-Handling Mechanisms-for example, datacentre quantised Congress notification (dcqcn).
- Advanced Traffic-Balancing Capabilities, Including Constion-Aware Load Balancing.
- Remote Direct Memory Access (RDMA) -Aware Load balance and packet spraying.
Support for Static Pinning of Flows
Furthermore, the software to manage ai networking fabrics must be enhanced as well. This requires functionality at the management layer to alert, Diagnose and remedies issues quickly. In Particular, Management Software that Provides Advanced Granular Telemetry In addition, the ability to monitor and alert (in real time) and Provide Historical Reporting for Bandwidth Utilization, Packet Loss, Jitter, Latency and Availability at the Sub-Second Level is Requird.
Ultra Ethernet (And Accelerator) Support
When building fabrics, Gartner Advises It Leaders to Consider Hardware Providers that PLEDGE TO Support The Ultra Ethernet Consortium (UEC) and Ultra Accelerator Link (UAL) Specifications.
The UEC is developing an industry standard to support High-Performance Workloads on Ethernet. As of February 2025, there is no proposed standard available, but gartner expects a proposal before the end of 2025. The need for a standard stems from the fact that suppliers Mechanisms to provide the high-performance ethernet negaory for ai connectivity.
Long term, these reduces interoperability for customers as it locks them into a single supplier's implementation. The benefit of suppliers confirming a consistent uEC standard is the ability to interoperate.
There is also a separete, but related, standard efforts for shelf/Rack/Row-Optimized Accelerator Link Called The UAL. The goal of ual is to standardise a high-speed, scale-up accelerator interconnect technology aimed at addressing skale-up network bandwidth needs that are beyond what eathernaet and infiniband are Currently capable of.
Reduce Risk with Co-CO-CERTIED Implements
Finally, because of the stringent performance requirements for ai workloads, Connectivity Between GPU and Network Switches Needs Needs to Be Optimized and Error-free from a Hardware and SoftWare Perspective. This can be Increasingly Challenging, Given the Rapid Pace of Change Associated with Bot Networking and GPU Technology.
To mitigate the potential for implementation challenges, gartner recommends following validated implementation guides that are co-recovery (See box: Benefits of Co-CErtification of Networking Gpus) by the networking and gPU suppliers. The value of following co -cirtified design is that bot suppliers should stand by deployments that are done according to this Specification, Ultimately Reduction the Likelihood of IsSumes and Decreasing means Times Repair (MTTR) in the event of an issue.
This article is based on an excerpt of the gartner report, Key Networking Practices to support ai workloads in the data center, Andrew Lerner is a distinguished vice-president analyst at gartner.