Transforming AI Fabric with ONES: Enhanced Observability for GPU Performance

November 6, 2024

Explore the latest in AI network management with our ONES 3.0 series

Future of Intelligent Networking for AI Fabric Optimization

If you’re operating a high-performance data center or managing AI/ML workloads, ONES 3.0 offers advanced features that ensure your network remains optimized and congestion-free, with lossless data transmission as a core priority.

In today’s fast-paced, AI-driven world, network infrastructure must evolve to meet the growing demands of high-performance computing, real-time data processing, and seamless communication. As organizations build increasingly complex AI models, the need for low-latency, lossless data transmission, and sophisticated scheduling of network traffic has become crucial. ONES 3.0 is designed to address these requirements by offering cutting-edge tools for managing AI fabrics with precision and scalability.

Building on the solid foundation laid by ONES 2.0, where RoCE (RDMA over Converged Ethernet) support enabled lossless communication and enhanced proactive congestion management, ONES 3.0 takes these capabilities to the next level. We’ve further improved RoCE features with the introduction of PFC Watchdog (PFCWD) for enhanced fault tolerance, Scheduler for optimized traffic handling, and WRED for intelligent queue management, ensuring that AI workloads remain highly efficient and resilient, even in the most demanding environments.

Why RoCE is Critical for Building AI Models

As the next generation of AI models requires vast amounts of data to be transferred quickly and reliably across nodes, RoCE becomes an indispensable technology. By enabling remote direct memory access (RDMA) over Ethernet, RoCE facilitates low-latency, high-throughput, and lossless data transmission—all critical elements in building and training modern AI models.

In AI workloads, scheduling data packets effectively ensures that model training is not delayed due to network congestion or packet loss. RoCE’s ability to prioritize traffic and ensure lossless data movement allows AI models to operate at optimal speeds, making it a perfect fit for today’s AI infrastructures. Whether it’s transferring large datasets between GPU clusters or ensuring smooth communication between nodes in a distributed AI system, RoCE ensures that critical data flows seamlessly without compromising performance.

Enhancing RoCE Capabilities from ONES 2.0 to ONES 3.0

In ONES 3.0, we’ve taken RoCE management even further, enhancing the ability to monitor and optimize Priority Flow Control (PFC) and ensuring lossless RDMA traffic under heavy network loads. The new PFC Watchdog (PFCWD) ensures that any misconfiguration or failure in flow control is detected and addressed in real-time, preventing traffic stalls or congestion collapse in AI-driven environments.

Additionally, ONES 3.0’s Scheduler allows for more sophisticated data packet scheduling, ensuring that AI tasks are executed with precision and efficiency. Combined with WRED (Weighted Random Early Detection), which intelligently manages queue drops to prevent buffer overflow in congested networks, ONES 3.0 provides a holistic solution for RoCE-enabled AI fabrics.

The Importance of QoS and RoCE in AI Networks

Quality of Service (QoS) and RoCE are pivotal in ensuring that AI networks can handle the rigorous demands of real-time processing and massive data exchanges without performance degradation. In environments where AI workloads must process large amounts of data between nodes, QoS ensures that critical tasks receive the required bandwidth, while RoCE ensures that this data is transmitted with minimal latency and no packet loss.

With AI workloads demanding real-time responsiveness, any network inefficiency or congestion can slow down AI model training, leading to delays and sub-optimal performance. The advanced QoS mechanisms in ONES 3.0, combined with enhanced RoCE features, provide the necessary tools to prioritize traffic, monitor congestion, and optimize the network for the low-latency, high-reliability communication that AI models depend on.

In ONES 3.0, QoS features such as DSCP mapping, WRED, and scheduling profiles allow customers to:

By leveraging QoS in combination with RoCE, ONES 3.0 creates an optimized environment for AI networks, allowing customers to confidently build and train next-generation AI models without worrying about data bottlenecks.

1. Comprehensive Interface and Performance Metrics

The UI showcases essential network performance indicators such as In/Out packet rates, errors, and discards, all displayed in real time. These metrics give customers the ability to:

By having access to real-time and historical data, customers can make data-driven decisions to optimize network performance without sacrificing the quality of their AI workloads.

2. RoCE Config Visualization

RoCE (RDMA over Converged Ethernet) is a key technology used to achieve high-throughput and low-latency communication, especially when training AI models, where data packets must flow without loss. In ONES 3.0, the RoCE tab within the UI offers full transparency into how data traffic is managed:

3. Visual Traffic Monitoring: A Data-Driven Experience

The UI doesn’t just give you raw data—it helps you visualize it. With multiple graphing options and real-time statistics, customers can easily monitor:

4. Flexible Time-Based Monitoring and Analysis

Customers have the option to track metrics over various time periods, from live updates (1 hour) to historical views (12 hours, 2 weeks, etc.). This flexibility allows customers to:

This feature is especially valuable for customers running AI workloads, where consistent performance over extended periods is vital for the accuracy and efficiency of model training.

Centralized QoS View

ONES 3.0 offers a unified interface for all QoS configurations, including DSCP to TC mappings, WRED/ECN, and scheduler profiles, making traffic management simpler for network admins.

This page provides administrators with comprehensive insights into how traffic flows through the network, allowing them to fine-tune and optimize their configurations to meet the unique demands of modern workloads.

Fig 1 – QoS Profile List

Comprehensive Topology View

ONES offers a comprehensive, interactive map of network devices and their connectivity, ideal for monitoring AI/ML and RoCE environments. It provides an actionable overview that simplifies network management.

Fig 2 – AI-ML Topology View

Key features include:

Overall, the Topology Page in ONES enhances network observability and control, making it easier to optimize performance, troubleshoot issues, and ensure the smooth operation of AI/ML and RoCE workloads.

Proactive Monitoring and Alerts with the Enhanced ONES Rule Engine

The ONES Rule Engine has been a standout feature in previous releases, providing robust monitoring and alerting capabilities for network administrators. With the latest update, we’ve enhanced the usability and functionality, making rule creation and alert configuration even smoother and more intuitive. Whether monitoring RoCE metrics or AI-Fabric performance counters, administrators can now set up alerts with greater precision and ease. This new streamlined experience allows for better anomaly detection, helping prevent network congestion and data loss before they impact performance.

The ONES Rule Engine offers cutting-edge capabilities for proactive network management, enabling real-time anomaly detection and alerting. It provides deep visibility into AI-Fabric metrics like queue counters, PFC events, packet rates, and link failures, ensuring smooth performance for RoCE-based applications. By allowing users to set custom thresholds and conditions for congestion detection, the Rule Engine ensures that network administrators can swiftly address potential bottlenecks before they escalate.

With integrated alerting systems such as Slack and Zendesk, administrators can respond instantly to network anomalies. The ONES Rule Engine’s automation streamlines monitoring and troubleshooting, helping prevent data loss and maintain optimal network conditions, ultimately enhancing the overall network efficiency.

Conclusion

In an era where AI and machine learning are driving transformative innovations, the need for a robust and efficient network infrastructure has never been more critical. ONES 3.0 ensures that AI workloads can operate seamlessly, with minimal latency and no packet loss.

FAQs

1. Why is RoCE critical for AI infrastructure and model training?

RoCE (RDMA over Converged Ethernet) is essential for AI because it enables:

Low-latency, high-throughput data transfers between GPU nodes
Lossless communication, vital for real-time model training
Efficient memory access without CPU involvement
This makes RoCE a foundational technology for building and scaling AI/ML workloads.

2. How does ONES 3.0 improve RoCE management and observability?

ONES 3.0 advances RoCE integration through:

PFC Watchdog (PFCWD) for monitoring and recovering from flow control issues
Advanced scheduling tools (DWRR, WRR, STRICT) to manage packet priorities
WRED-based queue management to prevent buffer overflows

These features ensure network reliability, even under high AI traffic loads.

3. What QoS features are included in ONES 3.0 for optimizing AI network traffic?

Quality of Service (QoS) is crucial for prioritizing AI tasks. ONES 3.0 includes:

DSCP and dot1p mapping for accurate traffic classification
Priority queue configuration to handle mission-critical packets
Real-time congestion alerts and traffic shaping for lossless AI data transmission

Together, these ensure uninterrupted, high-performance AI workloads.

4. How does ONES 3.0 visualize and monitor RoCE configurations in real time?

The ONES UI provides deep visibility into:

DSCP and 802.1p mapping to queues and priority groups
WRED and PFC stats for congestion handling
Scheduler profiles and queue usage across switches

This empowers network admins to proactively tune RoCE traffic and avoid disruptions.

5. What role does the ONES Rule Engine play in maintaining AI network performance?

The enhanced ONES Rule Engine enables proactive, automated management through:

Custom alert rules for RoCE, queue drops, and link failures
Slack/Zendesk integration for instant anomaly notifications
Granular threshold settings to prevent issues before they affect AI training
It turns ONES into an intelligent observability and incident response system.

6. How does advanced network observability help maintain lossless traffic for AI workloads?

Provides live metrics on packet drops, queue usage, and PFC status
Visualizes congestion points across GPU clusters
Correlates anomalies with traffic spikes for faster mitigation
Ensures AI data pipelines stay lossless under peak loads

7. Can an AI network assistant automate congestion handling for RoCE traffic?

Yes — an AI network assistant can:

Monitor PFC and ECN metrics continuously
Auto-tune scheduling weights based on real-time load
Trigger alerts or pause flows before congestion impacts GPU compute jobs
Maintain smooth RoCE packet flow with minimal manual intervention

8. Why is real-time queue monitoring important in GPU-centric AI networks?

In GPU-heavy environments:

Queue buildup signals impending congestion, risking packet drops
Monitoring helps tune WRED thresholds proactively
Admins can rebalance workloads to prevent data stalls
This keeps model training and inferencing pipelines running at optimal speed

9. How do centralized QoS views benefit network operations teams?

A centralized QoS dashboard enables teams to:

Review all DSCP, PFC, and WRED mappings in one place
Detect misconfigurations quickly
Compare live queue stats across multiple switches
Fine-tune policies without logging into each device separately

10. How does topology visualization support AI fabric reliability?

An interactive topology map offers:

End-to-end visibility of switches, servers, and GPU nodes
Instant health status and link performance
Rapid isolation of failures (e.g., faulty fans, down ports)
Better understanding of data flows, helping admins optimize AI cluster connectivity

Neekshitha dyasani

Blog Author

How Techevolution Modernized Its Data Centers with Aviz and SONiC

August 4, 2025

How Aitire Modernized Its Network — Without Costly Hardware Upgrades

August 4, 2025

What Is SONiC Anyway — a Cartoon Character or the Future of Enterprise Networking?

July 9, 2025

Share the Post:

SONiC

Network Observability

AI Network Assistant

Networks for AI

AI for Networks

Latest Blog

Why Partner with Us?

Latest Blog

Login to Partner Portal

Documentation

Validated Designs for SONiC

FAQs

Help

Support

Transforming AI Fabric with ONES: Enhanced Observability for GPU Performance

November 6, 2024

Explore the latest in AI network management with our ONES 3.0 series

Future of Intelligent Networking for AI Fabric Optimization

Why RoCE is Critical for Building AI Models

Enhancing RoCE Capabilities from ONES 2.0 to ONES 3.0

The Importance of QoS and RoCE in AI Networks

1. Comprehensive Interface and Performance Metrics

2. RoCE Config Visualization

3. Visual Traffic Monitoring: A Data-Driven Experience

4. Flexible Time-Based Monitoring and Analysis

Centralized QoS View

Comprehensive Topology View

Proactive Monitoring and Alerts with the Enhanced ONES Rule Engine

Conclusion

FAQs

Neekshitha dyasani

Blog Author

Subscribe to Aviz latest updates

Subscribe to Our Newsletter

Contact Us

Sign up to read more!

Transforming AI Fabric with ONES: Enhanced Observability for GPU Performance