Exciting Announcement! In celebration of launching our AI Certification, we’re thrilled to offer a 50% discount exclusively. Seize this unique chance—don’t let it slip by!

ONES Rule Engine: Enhanced Monitoring and Alerting for AI-Fabric

The ONES Rule Engine is a sophisticated feature that enhances your network management capabilities by incorporating an integrated alert and notification system. It delivers detailed monitoring metrics and facilitates easy creation of rules at both device and interface levels. The latest update to the ONES Rule Engine has broadened its capabilities to monitor AI-Fabric metrics such as queue counters, PFC, traffic rates, and link and node failures. This enhancement allows administrators to achieve better visibility into network performance, pinpoint potential issues, and proactively maintain optimal conditions for RoCE-based applications and workloads.

Anomaly Detection and Alerting on AI-Fabric

The following AI fabric counters help DCOs identify and prevent network congestion and data loss.

Queue Counters

Performance Counters

ONES 2.1 Rule Engine for Anomalies & Alerting

PFC Receive and Transmit Counters

Priority Flow Control (PFC) is a mechanism that prevents frame loss due to congestion. It operates by sending priority pause frames (per traffic class) to the sender when buffer thresholds are exceeded due to congestion. The count of priority pause frames sent/received by the device is available in the PFC counters.With the active monitoring of PFC counters, the ONES rule engine’s can create alerts Data center operators and administrators on potential congestion and hotspots. Customers have the flexibility to set their desired congestion threshold for alerting using the various attribute available:

Figure 1: Rule Configuration -Interface PFC receive counters

When conditions are met, the ONES rule engine dispatches alerts through configured channels such as Slack and Zendesk, as well as on the Watcher – Alerts page. These alerts provide essential information about the generated alert, including details on the device, interface, and queue.

Queue Drop Counters:

When setting up a network for lossless applications such as RoCE, it’s crucial to also monitor flows that may become lossy. Egress queue drop counters are vital for identifying congestion and traffic drops on outbound ports. Analyzing these egress queue drops helps customers troubleshoot network congestion and resolve performance issues. Furthermore, queue drop counters can be activated at the device level, allowing for an overall assessment of queue drops across all queues on every interface of a device.

  Figure 2: Rule for Queue drops
Figure 3: Filter options for interface Queue Drop counters rule

Failure Detection: Link Flap

“Link Failure” is another critical metric which needs to be monitored in AI-Fabric. Bad links due to improper cabling and transceivers can significantly affect the lossless requirement for RoCE traffic. It is critical to alert and take corrective action to avoid traffic loss and performance degradation. Corrective actions could include replacing bad optics or adjust the control plane policies to re-route the traffic towards a better path. ONES rule engine performs the continuous monitoring of links over a specific interval and automatically creates alerts with necessary payload including the device, Optics information, details, device location and layer etc. This can help the DCOs have all the necessary data to take the corrective action.
Figure 4: Alert payload – link down
Share the Post:

Related Posts

AI is revolutionizing every sector, and at Aviz, we’re pioneering the transformation of enterprise networking with AI-driven solutions. We’re thrilled to announce that Nick Lippis, co-founder and co-chair of ONUG, and the producer of The

We are thrilled to unveil Aviz Network Copilot™ v1.1.0, packed with innovative features and enhancements. This cutting-edge AI-driven network analysis tool is crafted to help network operators, executives, and stakeholders pinpoint performance bottlenecks and optimize

Introduction to AI-TRiSM (Trust, Risk & Security Management) As AI reshapes the world, its transformative power drives revolutionary innovations across every sector. The benefits are immense, offering businesses a competitive edge and optimizing operations. However,

ONES Rule Engine: Enhanced Monitoring and Alerting for AI-Fabric

The ONES Rule Engine is a sophisticated feature that enhances your network management capabilities by incorporating an integrated alert and notification system. It delivers detailed monitoring metrics and facilitates easy creation of rules at both device and interface levels. The latest update to the ONES Rule Engine has broadened its capabilities to monitor AI-Fabric metrics […]