Exciting Announcement! In celebration of launching our AI Certification, we’re thrilled to offer a 50% discount exclusively. Seize this unique chance—don’t let it slip by!

AI Fabric Orchestration: Supercharging AI Networks with SONiC NOS

Explore the latest in AI network management with our ONES 3.0 series

As the demand for high-performance parallel processing surges in the AI era, GPU clusters have become the heart of data-intensive workloads. But it’s not just about the GPUs themselves—intercommunication between GPU servers is the backbone of their overall performance. Enters the network switch fabric, which is pivotal in overcoming communication bottlenecks and ensuring seamless data flow between GPU servers. Technologies like RoCE (RDMA over Converged Ethernet) allow massive chunks of data to move efficiently between servers, but ensuring that these critical data streams remain lossless and uncongested requires a powerful solution.

That’s where SONiC’s QoS (Quality of Service) features come into play. SONiC enables you to prioritize critical data traffic, ensuring high-priority packets are always transferred ahead of other traffic and also that your important data is not lost. Using SONiC’s robust QoS capabilities and ONES 3.0’s orchestration, you can turn your switch fabric into a lossless, priority-driven highway for GPU server communications.

Let’s explore how you can achieve this through SONiC via ONES 3.0 Fabric Manager orchestration tool.

Lossless And Prioritized Data Flow

Any packet entering the fabric with any DSCP/DOT1P marking can be mapped to any queue of the interface and enabling PFC on this queue makes it lossless. With PFC in place, when congestion is detected in the queue, a pause frame is sent back to the sender, signaling it to temporarily halt sending traffic of that priority. This mechanism effectively prevents packet drops, ensuring lossless transmission for traffic of particular priority.

Beyond PFC, there’s another layer of congestion management—Explicit Congestion Notification (ECN). With ECN, we can define buffer thresholds, exceeding which Congestion Notification (ECN-CNP) packets are sent to the sender, prompting it to reduce the transmission rate and proactively avoid congestion.

At this stage, we’ve ensured that our priority traffic is lossless. Moving into the egress phase, we can further enhance performance by prioritizing this traffic over others, even under congestion. SONiC provides scheduling algorithms like Deficit Weighted Round Robin (DWRR), Weighted Round Robin (WRR), and Strict Priority Scheduling (STRICT). By binding priority queues to these schedulers, the system can ensure that higher-priority traffic is transmitted preferentially, either in a weighted manner (for WRR/DWRR) or with absolute priority (for STRICT).

In summary, through PFC, ECN, and advanced scheduling techniques, SONiC ensures that high-priority traffic from GPU servers is not only lossless but also prioritized during both congestion and egress phases.

Simplifying Complex QoS Configurations with ONES Orchestration

Configuring SONiC’s complex QoS features may sound daunting, but with ONES 3.0’s seamless orchestration, it’s a breeze. ONES allows you to set up essential QoS configurations like DSCP to traffic-class mapping, PFC, ECN thresholds, and even scheduler types—all with a few lines in a YAML template. Here’s a snapshot of the YAML template showcasing how ONES orchestrates SONiC QoS (QoS is the section in YAML below)

ONES UI AI Fabric Orchestration YAML Template
Fig 1 – ONES UI AI Fabric Orchestration YAML Template

The Fabric Manager automates the creation and assignment of QoS profiles, saving administrators from manually configuring multiple aspects. Here’s how it works:

Mapping Traffic Classes and Queues

Orchestration begins by mapping traffic into appropriate classes and queues. ONES 3.0 Orchestration allows you to specify mapping values from DSCP (Layer 3) and dot1p (Layer 2) to traffic classes, traffic classes to queues, and traffic classes to priority groups (PGs). Upon specifying these mapping values, profiles would be created with standard namings using these mapping values like DOT1P_TC_PROFILE, TC_QUEUE_PROFILE, TC_PG_PROFILE, DSCP_TC_PROFILE and are binded to the interfaces that are part of the orchestration. This configuration ensures that each type of traffic is routed to its appropriate queue and handled correctly.

For example, we can specify mapping values in the YAML as above in image and FM will create the corresponding profiles and bind it to the interface as below:

Priority Flow Control (PFC) and Explicit Congestion Notification (ECN)

The next critical part of QoS orchestration involves Priority Flow Control (PFC), where ONES YAML allows users to define the queues that should be PFC-enabled. Moreover a PFC Watchdog can be configured to ensure that the PFC is well functioning with restoration, detection times and action to be taken in case of malfunctioning .

ECN configuration parameters can be provided in the YAML template using which ONES Fabric Manager creates a profile WRED_PROFILE and attaches it to all the queues that are PFC enabled for all the interfaces that are part of orchestration.

Here’s an example of how this would be configured on the interface for the YAML input in the above image.

This approach ensures that your network proactively manages congestion and minimizes packet drops for high-priority traffic.

Advanced Scheduling for Optimized Egress

Finally, Scheduling plays a vital role in controlling how packets are forwarded from queues. Orchestration allows administrators to choose between scheduling mechanisms such as Deficit Weighted Round Robin (DWRR), Weighted Round Robin (WRR), or STRICT priority scheduling, depending on their needs.

In the case of DWRR or WRR, weights can be assigned to each queue, influencing how often a queue is serviced relative to others. Upon specifying these parameters in the YAML, ONES-FM creates the scheduler policies (SCHEDULER.<weight>) each for a unique weight assigned to the queues and attach these created policies to the queues according to their weightage for all the interfaces that are part of the orchestration.

For instance in the below given image YAML input, there are two unique weights 60 and 40 that are assigned to queue 3 and 4 respectively. So, two scheduler policies SCHEDULER.40, SCHEDULER.60 are created and binded to the interface queues 3 and 4 respectively.

Now, here comes a question , what if all the queues are congested. How does the congestion notification packets even traverse through the network to reach the sender to stop or slow down the traffic coming in ?

ONES-FM provides an option to designate a specific queue for ECN_CNP (Explicit Congestion Notification packets) traffic, using STRICT scheduling, ensuring that even when the network is heavily congested there is always a room left for the congestion notification packets, preventing further blockages. cnp_queue under the ECN section in the above image represents that and is orchestrated as below by ONES-FM:

Flexible, Day-2 Support for QoS Management

One of the standout features of ONES-FM 3.0 is its support for Day-2 operations. As your network evolves and traffic patterns change, you can modify the QoS configurations through either the YAML template or the NetOps API. This flexibility ensures your network is always tuned to deliver the performance required by your AI workloads.

Future-Proof Your AI Infrastructure with ONES 3.0

With its intuitive YAML-based approach and support for dynamic Day-2 adjustments, ONES Fabric Manager eliminates much of the complexity associated with configuring and managing networks. ONES makes one confident that network infrastructure is both reliable and future-proof. In essence, ONES Fabric Manager enables seamless orchestration for AI fabrics, ensuring your network is always ready to meet the growing demands of AI-driven data centers.
Share the Post:

Related Posts

Explore the latest in AI network management with our ONES 3.0 series Future of Intelligent Networking for AI Fabric Optimization If you’re operating a high-performance data center or managing AI/ML workloads, ONES 3.0 offers advanced

Explore the latest in AI network management with our ONES 3.0 series ONES 3.0 introduces a range of exciting new features, with a focus on scaling data center deployments and support. In this blog post,

Introduction Artificial Intelligence (AI), once a mere buzzword, has now firmly established itself as a cornerstone of technological advancement. Its insatiable appetite for data fuels its continuous evolution, and generative AI, a subset capable of

AI Fabric Orchestration: Supercharging AI Networks with SONiC NOS

Explore the latest in AI network management with our ONES 3.0 series As the demand for high-performance parallel processing surges in the AI era, GPU clusters have become the heart of data-intensive workloads. But it’s not just about the GPUs themselves—intercommunication between GPU servers is the backbone of their overall performance. Enters the network switch […]