The Challenge of Evolving AI Workloads in Data Centers

2024年5月14日 · 2 min read

The rapid evolving growth of AI is impacting hyperscaler data centers, as high volumes of traffic and intensive processing requirements push the limits of current networks. Data center architectures and the high-speed networks they rely on must be reevaluated, along with their testing strategies, to ensure reliable performance now and into the future.

Generative AI (GenAI) has delivered an astounding leap forward in the capabilities and value of AI models. But these new abilities come at a price: most AI models are so massive that training and inferencing must be distributed and run across multiple compute resources and accelerators (i.e., xPUs) and performed in parallel. The fast expansive growth of AI is impacting hyperscaler data centers as high volumes of traffic and intensive processing requirements push the limits of current networks.

The accelerating evolution of data center architecture

To successfully support AI’s rapid evolution, data center architectures and the high-speed networks they rely on, must be reevaluated. AI application model complexity and size dictate the level of compute, memory, and network type and scale needed to connect AI accelerators used for training and inferencing.

Driven by AI workloads, data center requirements are growing at astounding rates:

AI models are growing in complexity by 1,000 times every three years.
New models have billions, and soon trillions, of dense parameters.
Apps will require thousands of GPU accelerators.
Cluster size is quadrupling every two years.
Network bandwidth needed per accelerator is growing to more than 1 Tbps.
Traffic is growing by a factor of 10 every two years.
The number of hyperscale data centers is expected to increase from 700 in 2022, to 1,000 by 2025.

At the same time, AI workloads are driving an unprecedented demand for low latency and high bandwidth connectivity between servers, storage, and accelerators.

The scale required for support doesn’t come from simply adding racks to a data center. Handling large AI training and inference workloads requires a separate, scalable, routable backend network infrastructure to connect distributed GPU nodes. AI apps have less impact on the frontend Ethernet networks that use general purpose servers to provide AI data ingestion for the training process.

The requirements for this new backend network differ considerably from traditional data center frontend access networks. In addition to higher traffic and increased network bandwidth per accelerator, the backend network needs to support thousands of synchronized parallel jobs, as well as data and compute-intensive workloads. The network must be scalable and provide low latency and high bandwidth connectivity between servers, storage, and the GPUs essential for AI training and inferencing.

The requirements for this new backend network differ considerably from traditional data center frontend access networks.

The AI data center journey is just beginning and will change dramatically as AI evolves, promising to be transformative. Data center architectures should be evaluated for future proofing, sooner rather than later, as new strategies for required success will continue to emerge.

And while the data center is the foundational building block for AI data management, other leading industries including telecommunications and enterprises are looking to develop targeted AI powered use cases focusing on achieving substantial operational efficiencies and new business outcomes. To prioritize where and how to start incorporating AI, these industries need to determine the cost-benefit of use cases. Best practice dictates that data architecture and automation frameworks should be addressed first to reap early benefits and set the scene for successful longer-term AI delivered outcomes.

To learn more about AI data center networking challenges, AI’s wider applications and potential, and how testing effectively helps mitigate the challenges to foster strategic success, read our eBook: Bracing for Impact: How AI Will Transform Digital Industries.

コンテンツはいかがでしたか？

こちらで当社のブログをご購読ください。

ブログニュースレターの購読

タグ: Ethernet + IP, 高速イーサネット, Faster Time to Market, Reduce Complexity & Cost

Stephen Douglas

市場戦略統括

Spirent is a global leader in automated test and assurance for the ICT industry and Stephen heads Spirents market strategy organization developing Spirents strategy, helping to define market positioning, future growth opportunities, and new innovative solutions. Stephen also leads Spirent’s strategic initiatives for 5G and future networks and represents Spirent on a number of Industry and Government advisory boards. With over 25 years’ experience in telecommunications Stephen has been at the cutting edge of next generation technologies and has worked across the industry with service providers, network equipment manufacturers and start-ups, helping them drive innovation and transformation.