Generative AI (GenAI) has delivered an astounding leap forward in the capabilities and value of AI models. But these new abilities come at a price: most AI models are so massive that training and inferencing must be distributed and run across multiple compute resources and accelerators (i.e., xPUs) and performed in parallel. The fast expansive growth of AI is impacting hyperscaler data centers as high volumes of traffic and intensive processing requirements push the limits of current networks.
The accelerating evolution of data center architecture
To successfully support AI’s rapid evolution, data center architectures and the high-speed networks they rely on, must be reevaluated. AI application model complexity and size dictate the level of compute, memory, and network type and scale needed to connect AI accelerators used for training and inferencing.
Driven by AI workloads, data center requirements are growing at astounding rates:
AI models are growing in complexity by 1,000 times every three years.
New models have billions, and soon trillions, of dense parameters.
Apps will require thousands of GPU accelerators.
Cluster size is quadrupling every two years.
Network bandwidth needed per accelerator is growing to more than 1 Tbps.
Traffic is growing by a factor of 10 every two years.
The number of hyperscale data centers is expected to increase from 700 in 2022, to 1,000 by 2025.
At the same time, AI workloads are driving an unprecedented demand for low latency and high bandwidth connectivity between servers, storage, and accelerators.
The scale required for support doesn’t come from simply adding racks to a data center. Handling large AI training and inference workloads requires a separate, scalable, routable backend network infrastructure to connect distributed GPU nodes. AI apps have less impact on the frontend Ethernet networks that use general purpose servers to provide AI data ingestion for the training process.
The requirements for this new backend network differ considerably from traditional data center frontend access networks. In addition to higher traffic and increased network bandwidth per accelerator, the backend network needs to support thousands of synchronized parallel jobs, as well as data and compute-intensive workloads. The network must be scalable and provide low latency and high bandwidth connectivity between servers, storage, and the GPUs essential for AI training and inferencing.
The requirements for this new backend network differ considerably from traditional data center frontend access networks.
The AI data center journey is just beginning and will change dramatically as AI evolves, promising to be transformative. Data center architectures should be evaluated for future proofing, sooner rather than later, as new strategies for required success will continue to emerge.
And while the data center is the foundational building block for AI data management, other leading industries including telecommunications and enterprises are looking to develop targeted AI powered use cases focusing on achieving substantial operational efficiencies and new business outcomes. To prioritize where and how to start incorporating AI, these industries need to determine the cost-benefit of use cases. Best practice dictates that data architecture and automation frameworks should be addressed first to reap early benefits and set the scene for successful longer-term AI delivered outcomes.
To learn more about AI data center networking challenges, AI’s wider applications and potential, and how testing effectively helps mitigate the challenges to foster strategic success, read our eBook: Bracing for Impact: How AI Will Transform Digital Industries.