Exciting Announcement! In celebration of launching our AI Certification, we’re thrilled to offer a 50% discount exclusively. Seize this unique chance—don’t let it slip by!

Aviz ONES 2.0: Closing in on the Reality of SONiC-based AI Fabrics

As technology advances, several trends are emerging in the application of Generative AI for networking, paving the way for more intelligent and adaptive network infrastructures. Some notable trends include Predictive Network Analytics, AI-Enhanced QOS, Network Resource Optimization, Anomaly Detection, Simulation of Realistic Network Environments, Autonomous Network Operations. RoCE (RDMA over Converged Ethernet) can address several challenges posed to networking devices in the context of Generative AI. 

This serves as the foundation for the AI fabric due its improved model training speed, optimized and reliable data movement and its compatibility with Ethernet networks. Effective monitoring of RoCE traffic becomes instrumental in maintaining seamless operations. 

Another important technique, proactive congestion management is crucial for maintaining optimal performance, reliability, and efficiency. AI workloads often involve the exchange of large datasets and real-time communication between nodes. Network congestion can lead to performance degradation, slowing down data transfers and compromising the responsiveness of AI applications. By identifying and addressing potential congestion points before they impact performance, proactive congestion management helps prevent degradation in the performance of generative AI tasks. This ensures that AI models can operate at optimal speeds, meeting the demands of real-time or near-real-time processing needs.

 AI Fabric Insight- AI workload with GPU and DPU.
AI Fabric Insight

ONES – Crafted for SONiC based AI Fabric

In the ever-evolving realm of generative AI networks, where the need for high-performance and low-latency communication takes center stage, ONES 2.0 is set to redefine network optimization. This latest release presents a state-of-the-art solution meticulously crafted to streamline network operations. ONES seamlessly incorporates advanced features such as Priority Flow Control (PFC) counters for RoCE support, and proactive congestion management based on port and per port queue utilization details. ONES supports the collection of the metrics aiding the SONiC-Fabrics with AI across multiple vendor platforms offering excellent scalability support and powerhouses the data collection process. It also seamlessly integrates with the ONES ecosystem – orchestration, visibility, and support for third-party APIs including REST and Prometheus – offering the go-to solution for streamlined management, comprehensive monitoring, and flexible interoperability in complex network environments.

ONES Unveiling SONiC AI Fabrics & RoCE: A Visual Exploration 

ONES collects a set of valuable metrics that is instrumental in monitoring RoCE (RDMA over Converged Ethernet) as it provides insights into the flow control mechanisms and helps ensure the efficient and reliable communication of RoCE-enabled networks.  

How Does Metric Collection Empower AI Fabrics to Tackle Challenges?

  • Traffic Prioritization: These metrics reveal how different types of traffic are prioritized in the network. In RoCE, where low-latency communication is crucial, the ability to prioritize traffic ensures that RDMA operations and other critical data transmissions are given precedence.
  • Congestion Management: Help in monitoring and managing network congestion. RoCE networks can experience congestion, and PFC allows for the pause of non-critical traffic during congestion, preventing packet loss and ensuring the smooth operation of RDMA communication.

Powering AI with PFC and Rx/Tx Watermark counters
Powering AI with PFC and Rx/Tx Watermark counters
  • Quality of Service (QoS): RoCE networks often have specific QoS requirements. These metrics provide data on how well the network adheres to these QoS policies. Monitoring allows network administrators to ensure that RoCE traffic receives the necessary level of service, minimizing latency and optimizing performance.
  • Identifying Bottlenecks: ONES can highlight potential bottlenecks in the network. By monitoring the pause frames and PFC counters, administrators can identify areas of congestion or network inefficiencies that may impact RoCE performance.
  • Real-time Monitoring: Real-time monitoring done by ONES allows for immediate responsiveness to changes in network conditions. In RoCE environments, where rapid data transfers are common, timely identification and resolution of congestion issues contribute to maintaining low latency and high throughput.
  • Performance Optimization: Understanding these metrics enables administrators to optimize the performance of RoCE networks. By analyzing the data, adjustments can be made to network configurations, traffic prioritization, or resource allocation to enhance overall RoCE performance.
  • Capacity Planning: ONES metrics contribute to capacity planning by providing insights into how well the network can handle the current load and whether there is room for expansion. This is crucial for scaling RoCE networks to accommodate growing demands.
Topology Overview of RoCE Traffic: Nodes representing RoCE devices are connected by lines, showing the flow of RDMA data.
Figure 1: Topology Overview of RoCE Traffic

In the RoCE Traffic Topology GUI view, the flow unfolds dynamically, revealing the interconnected pathways of RDMA over Converged Ethernet (RoCE) traffic. Nodes representing devices engaged in RoCE communication are linked by lines indicating the data exchange routes. The graphical representation allows for an intuitive understanding of the network’s structure, emphasizing the direct, low-latency pathways characteristic of RoCE

Network landscape with PFC enabled interfaces: Blue dots represent RoCE-capable interfaces.
Figure 2: RoCE Enabled Interfaces

In the graphical user interface (Figure 2), a visual representation unfolds, showcasing the dynamic network landscape with PFC enabled interfaces. These interfaces, depicted in the intuitive display, highlight the integration of RDMA over Converged Ethernet (RoCE) capabilities. The interfaces identified by a blue dot have the capability to transport RoCE traffic.

Figure 3 depicts various provisions facilitating RoCE support on a device. In this case, the device is handling L3 lossless traffic on queues 3 and 4 of interface number 51.

QOS Configuration for RoCE support on a device. Device handles L3 lossless traffic on queues 3 and 4 of interface 51
Figure 3: QOS Configuration

Figure 4 below in ONES depicts the distribution of RoCE traffic alongside regular traffic on the interface along with the seamless transmission of lossless data even in congested conditions, revealing the count of pause frames sent/received by the device.

RoCE Traffic Segregation & PFC Counters: Visualizes RoCE and regular traffic distribution, ensuring lossless data transmission in congested conditions
Figure 4: RoCE Traffic Segregation  & PFC Counters

Queue drop counters play a pivotal role in AI Fabrics, offering crucial insights into the network’s performance and reliability. These counters specifically track instances where packets are dropped within the queuing system, providing valuable data for monitoring and optimization

QoS Drop Counters: Monitors network performance and reliability by tracking dropped packets
Figure 5: QoS Drop Counters

Conclusion

Based on the presented GUI snapshots, it’s evident that ONES offers a captivating visual experience, showcasing intricately designed software crafted explicitly for the AI Fabric on the SONiC platform. ONES doesn’t just fulfill the requirements of contemporary networking; it also enhances user interaction through intuitive visualization and advanced features. This platform signifies an innovative approach to orchestrating and visualizing networks across multiple vendors, delivering a customized solution for addressing the intricate nature of AI Fabric on the SONiC platform.

What’s next in store for our forthcoming blog series, where we’ll extensively explore these informative topics:

  • Detailed security compliance with ONES
  • In-depth analysis regarding the measurement of NWSLA

To immerse yourself in SONiC firsthand, visit ONES Center. Delve into a comprehensive case study of SONiC, please check out “Maximizing Success with SONiC”.

Share the Post:

Related Posts

Explore the latest in AI network management with our ONES 3.0 series Future of Intelligent Networking for AI Fabric Optimization If you’re operating a high-performance data center or managing AI/ML workloads, ONES 3.0 offers advanced

Explore the latest in AI network management with our ONES 3.0 series ONES 3.0 introduces a range of exciting new features, with a focus on scaling data center deployments and support. In this blog post,

Explore the latest in AI network management with our ONES 3.0 series As the demand for high-performance parallel processing surges in the AI era, GPU clusters have become the heart of data-intensive workloads. But it’s

ONES 2.0: Revolutionizing AI Fabric Performance with Deep RoCE Traffic Insights

Aviz ONES 2.0: Closing in on the Reality of SONiC-based AI Fabrics

As technology advances, several trends are emerging in the application of Generative AI for networking, paving the way for more intelligent and adaptive network infrastructures. Some notable trends include Predictive Network Analytics, AI-Enhanced QOS, Network Resource Optimization, Anomaly Detection, Simulation of Realistic Network Environments, Autonomous Network Operations. RoCE (RDMA over Converged Ethernet) can address several challenges […]