Exciting Announcement! In celebration of launching our AI Certification, we’re thrilled to offer a 50% discount exclusively. Seize this unique chance—don’t let it slip by!

ONES Rule Engine: Enhanced Monitoring and Alerting for AI-Fabric

The ONES Rule Engine is a sophisticated feature that enhances your network management capabilities by incorporating an integrated alert and notification system. It delivers detailed monitoring metrics and facilitates easy creation of rules at both device and interface levels. The latest update to the ONES Rule Engine has broadened its capabilities to monitor AI-Fabric metrics such as queue counters, PFC, traffic rates, and link and node failures. This enhancement allows administrators to achieve better visibility into network performance, pinpoint potential issues, and proactively maintain optimal conditions for RoCE-based applications and workloads.

Anomaly Detection and Alerting on AI-Fabric

The following AI fabric counters help DCOs identify and prevent network congestion and data loss.

Queue Counters

Performance Counters

ONES 2.1 Rule Engine for Anomalies & Alerting

PFC Receive and Transmit Counters

Priority Flow Control (PFC) is a mechanism that prevents frame loss due to congestion. It operates by sending priority pause frames (per traffic class) to the sender when buffer thresholds are exceeded due to congestion. The count of priority pause frames sent/received by the device is available in the PFC counters.With the active monitoring of PFC counters, the ONES rule engine’s can create alerts Data center operators and administrators on potential congestion and hotspots. Customers have the flexibility to set their desired congestion threshold for alerting using the various attribute available:

Figure 1: Rule Configuration -Interface PFC receive counters

When conditions are met, the ONES rule engine dispatches alerts through configured channels such as Slack and Zendesk, as well as on the Watcher – Alerts page. These alerts provide essential information about the generated alert, including details on the device, interface, and queue.

Queue Drop Counters:

When setting up a network for lossless applications such as RoCE, it’s crucial to also monitor flows that may become lossy. Egress queue drop counters are vital for identifying congestion and traffic drops on outbound ports. Analyzing these egress queue drops helps customers troubleshoot network congestion and resolve performance issues. Furthermore, queue drop counters can be activated at the device level, allowing for an overall assessment of queue drops across all queues on every interface of a device.

  Figure 2: Rule for Queue drops
Figure 3: Filter options for interface Queue Drop counters rule

Failure Detection: Link Flap

“Link Failure” is another critical metric which needs to be monitored in AI-Fabric. Bad links due to improper cabling and transceivers can significantly affect the lossless requirement for RoCE traffic. It is critical to alert and take corrective action to avoid traffic loss and performance degradation. Corrective actions could include replacing bad optics or adjust the control plane policies to re-route the traffic towards a better path. ONES rule engine performs the continuous monitoring of links over a specific interval and automatically creates alerts with necessary payload including the device, Optics information, details, device location and layer etc. This can help the DCOs have all the necessary data to take the corrective action.
Figure 4: Alert payload – link down
Share the Post:

Related Posts

Explore the latest in AI network management with our ONES 3.0 series Future of Intelligent Networking for AI Fabric Optimization If you’re operating a high-performance data center or managing AI/ML workloads, ONES 3.0 offers advanced

Explore the latest in AI network management with our ONES 3.0 series ONES 3.0 introduces a range of exciting new features, with a focus on scaling data center deployments and support. In this blog post,

Explore the latest in AI network management with our ONES 3.0 series As the demand for high-performance parallel processing surges in the AI era, GPU clusters have become the heart of data-intensive workloads. But it’s

ONES Rule Engine: Enhanced Monitoring and Alerting for AI-Fabric

The ONES Rule Engine is a sophisticated feature that enhances your network management capabilities by incorporating an integrated alert and notification system. It delivers detailed monitoring metrics and facilitates easy creation of rules at both device and interface levels. The latest update to the ONES Rule Engine has broadened its capabilities to monitor AI-Fabric metrics […]