Cisco’s new router unites disparate datacenters into AI training behemoths

Cisco has unveiled a new routing ASIC designed to help bit barn operators overcome power and capacity constraints by stitching together their existing datacenters into a single unified compute cluster.


The Cisco 8223, announced on Wednesday, is a 51.2 Tbps router powered by its in-house Silicon One P200 ASIC. Combined with 800 Gbps coherent optics, Cisco says the platform can support spans up to 1,000 kilometers. Connect up enough routers, and Cisco says the architecture can theoretically achieve an aggregate bandwidth of three exabits per second. That's more than enough to connect even the largest AI training clusters today.


In fact, such a network would be able to support multi-site deployments containing several million GPUs, though as you might expect, achieving that level of bandwidth won't be cheap, requiring thousands of routers to make it all work.


For customers who don't need a connection quite that fast, Cisco says the routers can support up to 13 Pbps of bandwidth using a smaller two-tiered network.


The idea of a high-speed, scale-across datacenter network has already caught the attention of several large cloud providers, including Microsoft and Alibaba, which Cisco tells us are evaluating the chips for potential deployment.


"This new routing chip will enable us to extend into the Core network, replacing traditional chassis-based routers with a cluster of P200-powered devices. This transition will significantly enhance the stability, reliability, and scalability of our DCI network," Dennis Cai, Alibaba Cloud's head of network infrastructure, said in a canned statement.


Cisco is only the latest networking vendor to jump on the distributed datacenter bandwagon. Earlier this year, Nvidia and Broadcom announced their own scale-across networking ASICs.

Much like the P200, Broadcom's Jericho4 is a 51.2 Tbps switch primarily designed for use in high-speed datacenter-to-datacenter fabrics. Broadcom says the chip can bridge datacenters up to 100 kilometers apart at speeds exceeding 100 Pbps.


Nvidia has also gotten in on the fun, teasing its Spectrum-XGS switches at Hot Chips earlier this summer. While details on the actual hardware are still thin, GPU bit barn operator CoreWeave has also committed to using the tech to connect its datacenters into a "single unified supercomputer."


While these switch and routing ASICs may help datacenter operators overcome power and capacity constraints, latency remains an ongoing challenge.


We often think of the speed of light as being instantaneous, but it's not actually that fast. A packet sent between two datacenters located 1,000 kilometers apart would take roughly five milliseconds, one way, to reach its destination, and that's before you take into consideration additional latency incurred by the transceivers, amplifiers, and repeaters required to get the signal to its destination in one piece.


With that said, research from Google's DeepMind team, published earlier this year, shows that many of these challenges can be overcome by compressing the models during training and strategically scheduling communications between the two datacenters. 


Source TheRegister

Read Also
Google Announces $10 Billion Investment in Data Center and AI Projects in Andhra Pradesh
Nvidia prepares data center industry for 1MW racks and 800-volt DC power architectures
CoreWeave to Work With Nvidia-Backed Start-Up on Texas AI Data Center Project

Research