Couldn’t make it to Transform 2022? Watch all the sessions from the summit in our on-demand library now! Look here.
Artificial intelligence (AI) and machine learning (ML) are more than algorithms – the right hardware to power your AI and ML calculations is key.
To speed job completion, AI and ML training clusters need high bandwidth and reliable transport with predictable low queue latency (queue latency is 1-2% of one job following the rest). of the answers). A high-performance interconnect can optimize high-performance computing (HPC) and data center workloads across your portfolio of hyper-converged AI/ML training clusters, resulting in lower latency for better training of models, higher utilization of data packets and lower operating costs.
As AI/ML training jobs become more prevalent, higher radix switches, which lower latency and power, and higher port speeds are critical to building larger training clusters with network topology. flat.
Ethernet switching to optimize performance
While network bandwidth requirements in data centers continue to increase dramatically, there is also a strong push to combine overall compute and storage infrastructure with optimized AI/ML training processors. As a result, AI/ML training clusters, in which multiple machines are specified for training, are driving demand for frameworks with high-bandwidth connectivity, high radix, and faster job completion while operating at high speed. network utilization.
MetaBeat will bring together thought leaders to provide guidance on how metaverse technology will transform the way all industries communicate and do business on October 4 in San Francisco, CA.
Effective load balancing to achieve high network utilization, as well as congestion control mechanisms to achieve predictable final latency, is critical to speeding job completion. Efficient virtualized data infrastructures, combined with capable hardware, can also improve CPU offloads and help network accelerators improve neural network training.
Ethernet-based infrastructures currently offer the best solution for a unified network. They combine low power with high bandwidth and radix, and the fastest serialization/deserialization (SerDes) speeds, with a predictable doubling of bandwidth every 18 to 24 months. With these advantages, in addition to its large ecosystem, Ethernet can provide the highest performance interconnection per watt per dollar for AI/ML and cloud-scale infrastructure.
According to IDC, the global Ethernet switch market grew 12.7% year over year to $7.6 billion in the first quarter of 2022 (1Q22). Broadcom offers the Tomahawk family of Ethernet switches to enable the next generation of unified networks.
Today, San Jose-based Broadcom announced the StrataXGS Tomahawk 5 series of switches, offering 51.2 Tb/s (Tbps) of Ethernet switching capacity in a single monolithic device, more than twice the bandwidth of its contemporaries, says the company.
“Tomahawk 5 has twice the capacity of Tomahawk 4. As a result, it is one of the fastest switching chips in the world,” Ram Velaga, senior vice president/general manager of Broadcom’s core switching group, told VentureBeat. “Newly added specific features and capabilities to optimize the performance of AI and ML networks make Tomahawk 5 twice as fast as the previous version.”
Tomahawk 5 switch chips are designed to help data centers and HPC environments accelerate AI and machine learning capabilities. The switch chip uses a Broadcom approach known as cognitive routing, an advanced shared packet buffer, programmable in band telemetry, with hardware-based link failover built into the chip.
Cognitive routing optimizes network link utilization by automatically and dynamically selecting the least loaded links in the system for each flow that passes through the switch. This is especially important for AI and ML workloads, which often combine short and long duration high bandwidth streams with low entropy.
“Cognitive routing is a step beyond adaptive routing,” Velaga said. “When you use adaptive routing, you only know about the data congestion between two points, but you don’t know the other endpoints,” Velaga said. Cognitive routing, he added, can make the system aware of conditions other than the next neighbor, rerouting an optimal path that provides better load balancing and avoids congestion.
Tomahawk 5 includes real-time dynamic load balancing, which monitors the usage of all links in the switch and downstream in the network to determine the best path for each flow. It also monitors the state of hardware links and automatically redirects traffic away from failed connections. These features improve network utilization and reduce congestion, resulting in a shorter JCT.
The future of Ethernet for AI and ML infrastructures
Ethernet has the features required for high-performance ML and AI training clusters: high bandwidth, end-to-end congestion management, load balancing, and fabric management at a lower cost than its contemporaries such as InfiniBand.
It is clear that Ethernet is a robust ecosystem that is constantly developing at a rapid rate of innovation. Broadcom has shown that it will continue to improve its Ethernet switches to keep up with the pace of innovation happening in the AI and ML industry, and will continue to be a part of HPC infrastructure well into the future.
The VentureBeat Mission is to be a digital public square for technical decision makers to learn about transformative business technology and transact. Learn more about membership.