The ONES Rule Engine is a sophisticated feature that enhances your network management capabilities by incorporating an integrated alert and notification system. It delivers detailed monitoring metrics and facilitates easy creation of rules at both device and interface levels. The latest update to the ONES Rule Engine has broadened its capabilities to monitor AI-Fabric metrics such as queue counters, PFC, traffic rates, and link and node failures. This enhancement allows administrators to achieve better visibility into network performance, pinpoint potential issues, and proactively maintain optimal conditions for RoCE-based applications and workloads.
Anomaly Detection and Alerting on AI-Fabric
Queue Counters
- Packet Transmit Rate: This counter tracks the rate at which packets are transmitted from the RoCE queue. A high transmit rate could indicate potential congestion or oversubscription.
- Packet Receive Rate: This counter measures the rate at which packets are received on the RoCE queue. Monitoring this rate can help identify potential bottlenecks or overloaded receivers.
- Dropped Packets: This counter tracks the number of packets dropped due to queue overflows or other reasons. Excessive packet drops can severely impact RoCE performance and should be investigated.
- ECN Marked Packets: Explicit Congestion Notification (ECN) is a mechanism used by RoCE to signal congestion. Monitoring ECN marked packets can help identify and mitigate congestion before it becomes severe.
Performance Counters
- PFC (Priority Flow Control): PFC is a mechanism used by RoCE to pause traffic temporarily to prevent packet loss. Monitoring PFC events can help identify potential congestion hotspots in the network.
ONES 2.1 Rule Engine for Anomalies & Alerting
PFC Receive and Transmit Counters
Priority Flow Control (PFC) is a mechanism that prevents frame loss due to congestion. It operates by sending priority pause frames (per traffic class) to the sender when buffer thresholds are exceeded due to congestion. The count of priority pause frames sent/received by the device is available in the PFC counters.With the active monitoring of PFC counters, the ONES rule engine’s can create alerts Data center operators and administrators on potential congestion and hotspots. Customers have the flexibility to set their desired congestion threshold for alerting using the various attribute available:- Time Interval: 5 min, 10 min, 15 min , 30 min, 1 hr
- Conditions: Threshold breach (Greater/Lesser/Equal)
- Count Occurrences: 1 to as high as millions
Figure 1: Rule Configuration -Interface PFC receive counters
When conditions are met, the ONES rule engine dispatches alerts through configured channels such as Slack and Zendesk, as well as on the Watcher – Alerts page. These alerts provide essential information about the generated alert, including details on the device, interface, and queue.
Queue Drop Counters:
When setting up a network for lossless applications such as RoCE, it’s crucial to also monitor flows that may become lossy. Egress queue drop counters are vital for identifying congestion and traffic drops on outbound ports. Analyzing these egress queue drops helps customers troubleshoot network congestion and resolve performance issues. Furthermore, queue drop counters can be activated at the device level, allowing for an overall assessment of queue drops across all queues on every interface of a device.