Exciting Announcement! In celebration of launching our AI Certification, we’re thrilled to offer a 50% discount exclusively. Seize this unique chance—don’t let it slip by!

ONES Rule Engine: Enhanced Monitoring and Alerting for AI-Fabric

May 7, 2024

The ONES Rule Engine is a sophisticated feature that enhances your network management capabilities by incorporating an integrated alert and notification system. It delivers detailed monitoring metrics and facilitates easy creation of rules at both device and interface levels. The latest update to the ONES Rule Engine has broadened its capabilities to monitor AI-Fabric metrics such as queue counters, PFC, traffic rates, and link and node failures. This enhancement allows administrators to achieve better visibility into network performance, pinpoint potential issues, and proactively maintain optimal conditions for RoCE-based applications and workloads.

Anomaly Detection and Alerting on AI-Fabric

The following AI fabric counters help DCOs identify and prevent network congestion and data loss.

Queue Counters

Performance Counters

ONES 2.1 Rule Engine for Anomalies & Alerting

PFC Receive and Transmit Counters

Priority Flow Control (PFC) is a mechanism that prevents frame loss due to congestion. It operates by sending priority pause frames (per traffic class) to the sender when buffer thresholds are exceeded due to congestion. The count of priority pause frames sent/received by the device is available in the PFC counters.With the active monitoring of PFC counters, the ONES rule engine’s can create alerts Data center operators and administrators on potential congestion and hotspots. Customers have the flexibility to set their desired congestion threshold for alerting using the various attribute available:

Figure 1: Rule Configuration -Interface PFC receive counters

When conditions are met, the ONES rule engine dispatches alerts through configured channels such as Slack and Zendesk, as well as on the Watcher – Alerts page. These alerts provide essential information about the generated alert, including details on the device, interface, and queue.

Queue Drop Counters:

When setting up a network for lossless applications such as RoCE, it’s crucial to also monitor flows that may become lossy. Egress queue drop counters are vital for identifying congestion and traffic drops on outbound ports. Analyzing these egress queue drops helps customers troubleshoot network congestion and resolve performance issues. Furthermore, queue drop counters can be activated at the device level, allowing for an overall assessment of queue drops across all queues on every interface of a device.

  Figure 2: Rule for Queue drops
Figure 3: Filter options for interface Queue Drop counters rule

Failure Detection: Link Flap

“Link Failure” is another critical metric which needs to be monitored in AI-Fabric. Bad links due to improper cabling and transceivers can significantly affect the lossless requirement for RoCE traffic. It is critical to alert and take corrective action to avoid traffic loss and performance degradation. Corrective actions could include replacing bad optics or adjust the control plane policies to re-route the traffic towards a better path. ONES rule engine performs the continuous monitoring of links over a specific interval and automatically creates alerts with necessary payload including the device, Optics information, details, device location and layer etc. This can help the DCOs have all the necessary data to take the corrective action.
Figure 4: Alert payload – link down

FAQs

1. What is the ONES Rule Engine and how does it enhance AI-Fabric monitoring?

The ONES Rule Engine is an integrated alert and notification system that provides detailed monitoring at the device and interface levels. It tracks critical AI-Fabric metrics like queue counters, PFC events, traffic rates, and link failures, enabling proactive detection of congestion, anomalies, and potential RoCE performance issues.

ONES Rule Engine monitors key AI-Fabric counters such as packet transmit and receive rates, dropped packets, ECN-marked packets, PFC transmit/receive counters, queue drop rates, and link flap failures  providing complete visibility into traffic performance, congestion points, and hardware health.

 By continuously tracking PFC events, queue drops, and ECN-marked packets, the ONES Rule Engine can generate real-time alerts when congestion thresholds are breached. This allows data center operators to take immediate corrective actions, ensuring lossless traffic flow essential for RoCE workloads.

Yes, when predefined conditions are met, ONES Rule Engine can dispatch real-time alerts through channels like Slack, Zendesk, and its internal Watcher-Alerts page. Each alert includes critical details about the device, queue, and interface to expedite troubleshooting and resolution.

ONES Rule Engine continuously monitors links for instability or flaps, especially critical in RoCE-based fabrics. Upon detecting issues, it automatically generates alerts with full payload details (device, optics, location, layer), enabling rapid corrective actions like optical replacements or traffic rerouting.

Share the Post:

Contact Us

Sign up to read more!

ONES Rule Engine: Enhanced Monitoring and Alerting for AI-Fabric

The ONES Rule Engine is a sophisticated feature that enhances your network management capabilities by incorporating an integrated alert and notification system. It delivers detailed monitoring metrics and facilitates easy creation of rules at both device and interface levels. The latest update to the ONES Rule Engine has broadened its capabilities to monitor AI-Fabric metrics […]