Category: Open Networking Enterprise Suite

Network Copilot Open Networking Enterprise Suite

Join Us at NVIDIA GTC: A Must-Watch Panel on AI in Networking!

Post author By Ilona Gabinsky
Post date 27 February 2025

AI is transforming every industry—including networking. As AI workloads scale, the infrastructure powering them must evolve. Networks must become smarter, faster, and more efficient to support the next wave of AI-driven innovation. At Aviz Networks, we believe in Networks for AI and AI for Networks—and we’re making it happen.

That’s why I’m thrilled to invite you to an exclusive panel at NVIDIA GTC, where we’ll explore NVIDIA’s role in transforming networking across these two dimensions and how Aviz Networks complements this ecosystem with innovative products.

Panel: Network Modernization in the Age of AI

AI-driven workloads demand a new era of networking—one that redefines how we design, deploy, and optimize infrastructure for peak performance. Our discussion will cover:

Why This Matters

AI is no longer just an application—it’s the backbone of modern enterprise infrastructure. But here’s the challenge: AI workloads are hungry for bandwidth, require extreme precision, and demand real-time optimization. Aviz Networks and NVIDIA are solving this with cutting-edge AI networking innovations.

Don’t just read about AI in networking—experience it. Watch our exclusive demo and join the panel discussion at NVIDIA GTC!

Open Networking Enterprise Suite

Transforming AI Fabric with ONES: Enhanced Observability for GPU Performance

Post author By Neekshitha dyasani
Post date 6 November 2024

Explore the latest in AI network management with our ONES 3.0 series

Future of Intelligent Networking for AI Fabric Optimization

If you’re operating a high-performance data center or managing AI/ML workloads, ONES 3.0 offers advanced features that ensure your network remains optimized and congestion-free, with lossless data transmission as a core priority.

In today’s fast-paced, AI-driven world, network infrastructure must evolve to meet the growing demands of high-performance computing, real-time data processing, and seamless communication. As organizations build increasingly complex AI models, the need for low-latency, lossless data transmission, and sophisticated scheduling of network traffic has become crucial. ONES 3.0 is designed to address these requirements by offering cutting-edge tools for managing AI fabrics with precision and scalability.

Building on the solid foundation laid by ONES 2.0, where RoCE (RDMA over Converged Ethernet) support enabled lossless communication and enhanced proactive congestion management, ONES 3.0 takes these capabilities to the next level. We’ve further improved RoCE features with the introduction of PFC Watchdog (PFCWD) for enhanced fault tolerance, Scheduler for optimized traffic handling, and WRED for intelligent queue management, ensuring that AI workloads remain highly efficient and resilient, even in the most demanding environments.

Why RoCE is Critical for Building AI Models

As the next generation of AI models requires vast amounts of data to be transferred quickly and reliably across nodes, RoCE becomes an indispensable technology. By enabling remote direct memory access (RDMA) over Ethernet, RoCE facilitates low-latency, high-throughput, and lossless data transmission—all critical elements in building and training modern AI models.

In AI workloads, scheduling data packets effectively ensures that model training is not delayed due to network congestion or packet loss. RoCE’s ability to prioritize traffic and ensure lossless data movement allows AI models to operate at optimal speeds, making it a perfect fit for today’s AI infrastructures. Whether it’s transferring large datasets between GPU clusters or ensuring smooth communication between nodes in a distributed AI system, RoCE ensures that critical data flows seamlessly without compromising performance.

Enhancing RoCE Capabilities from ONES 2.0 to ONES 3.0

In ONES 3.0, we’ve taken RoCE management even further, enhancing the ability to monitor and optimize Priority Flow Control (PFC) and ensuring lossless RDMA traffic under heavy network loads. The new PFC Watchdog (PFCWD) ensures that any misconfiguration or failure in flow control is detected and addressed in real-time, preventing traffic stalls or congestion collapse in AI-driven environments.

Additionally, ONES 3.0’s Scheduler allows for more sophisticated data packet scheduling, ensuring that AI tasks are executed with precision and efficiency. Combined with WRED (Weighted Random Early Detection), which intelligently manages queue drops to prevent buffer overflow in congested networks, ONES 3.0 provides a holistic solution for RoCE-enabled AI fabrics.

The Importance of QoS and RoCE in AI Networks

Quality of Service (QoS) and RoCE are pivotal in ensuring that AI networks can handle the rigorous demands of real-time processing and massive data exchanges without performance degradation. In environments where AI workloads must process large amounts of data between nodes, QoS ensures that critical tasks receive the required bandwidth, while RoCE ensures that this data is transmitted with minimal latency and no packet loss.

With AI workloads demanding real-time responsiveness, any network inefficiency or congestion can slow down AI model training, leading to delays and sub-optimal performance. The advanced QoS mechanisms in ONES 3.0, combined with enhanced RoCE features, provide the necessary tools to prioritize traffic, monitor congestion, and optimize the network for the low-latency, high-reliability communication that AI models depend on.

In ONES 3.0, QoS features such as DSCP mapping, WRED, and scheduling profiles allow customers to:

By leveraging QoS in combination with RoCE, ONES 3.0 creates an optimized environment for AI networks, allowing customers to confidently build and train next-generation AI models without worrying about data bottlenecks.

1. Comprehensive Interface and Performance Metrics

The UI showcases essential network performance indicators such as In/Out packet rates, errors, and discards, all displayed in real time. These metrics give customers the ability to:

By having access to real-time and historical data, customers can make data-driven decisions to optimize network performance without sacrificing the quality of their AI workloads.

2. RoCE Config Visualization

RoCE (RDMA over Converged Ethernet) is a key technology used to achieve high-throughput and low-latency communication, especially when training AI models, where data packets must flow without loss. In ONES 3.0, the RoCE tab within the UI offers full transparency into how data traffic is managed:

3. Visual Traffic Monitoring: A Data-Driven Experience

The UI doesn’t just give you raw data—it helps you visualize it. With multiple graphing options and real-time statistics, customers can easily monitor:

4. Flexible Time-Based Monitoring and Analysis

Customers have the option to track metrics over various time periods, from live updates (1 hour) to historical views (12 hours, 2 weeks, etc.). This flexibility allows customers to:

This feature is especially valuable for customers running AI workloads, where consistent performance over extended periods is vital for the accuracy and efficiency of model training.

Centralized QoS View

ONES 3.0 offers a unified interface for all QoS configurations, including DSCP to TC mappings, WRED/ECN, and scheduler profiles, making traffic management simpler for network admins.

This page provides administrators with comprehensive insights into how traffic flows through the network, allowing them to fine-tune and optimize their configurations to meet the unique demands of modern workloads.

Fig 1 – QoS Profile List

Comprehensive Topology View

ONES offers a comprehensive, interactive map of network devices and their connectivity, ideal for monitoring AI/ML and RoCE environments. It provides an actionable overview that simplifies network management.

Fig 2 – AI-ML Topology View

Key features include:

Overall, the Topology Page in ONES enhances network observability and control, making it easier to optimize performance, troubleshoot issues, and ensure the smooth operation of AI/ML and RoCE workloads.

Proactive Monitoring and Alerts with the Enhanced ONES Rule Engine

The ONES Rule Engine has been a standout feature in previous releases, providing robust monitoring and alerting capabilities for network administrators. With the latest update, we’ve enhanced the usability and functionality, making rule creation and alert configuration even smoother and more intuitive. Whether monitoring RoCE metrics or AI-Fabric performance counters, administrators can now set up alerts with greater precision and ease. This new streamlined experience allows for better anomaly detection, helping prevent network congestion and data loss before they impact performance.

The ONES Rule Engine offers cutting-edge capabilities for proactive network management, enabling real-time anomaly detection and alerting. It provides deep visibility into AI-Fabric metrics like queue counters, PFC events, packet rates, and link failures, ensuring smooth performance for RoCE-based applications. By allowing users to set custom thresholds and conditions for congestion detection, the Rule Engine ensures that network administrators can swiftly address potential bottlenecks before they escalate.

With integrated alerting systems such as Slack and Zendesk, administrators can respond instantly to network anomalies. The ONES Rule Engine’s automation streamlines monitoring and troubleshooting, helping prevent data loss and maintain optimal network conditions, ultimately enhancing the overall network efficiency.

Conclusion

In an era where AI and machine learning are driving transformative innovations, the need for a robust and efficient network infrastructure has never been more critical. ONES 3.0 ensures that AI workloads can operate seamlessly, with minimal latency and no packet loss.

FAQs

1. Why is RoCE critical for AI infrastructure and model training?

RoCE (RDMA over Converged Ethernet) is essential for AI because it enables:

Low-latency, high-throughput data transfers between GPU nodes
Lossless communication, vital for real-time model training
Efficient memory access without CPU involvement
This makes RoCE a foundational technology for building and scaling AI/ML workloads.

2. How does ONES 3.0 improve RoCE management and observability?

ONES 3.0 advances RoCE integration through:

PFC Watchdog (PFCWD) for monitoring and recovering from flow control issues
Advanced scheduling tools (DWRR, WRR, STRICT) to manage packet priorities
WRED-based queue management to prevent buffer overflows

These features ensure network reliability, even under high AI traffic loads.

3. What QoS features are included in ONES 3.0 for optimizing AI network traffic?

Quality of Service (QoS) is crucial for prioritizing AI tasks. ONES 3.0 includes:

DSCP and dot1p mapping for accurate traffic classification
Priority queue configuration to handle mission-critical packets
Real-time congestion alerts and traffic shaping for lossless AI data transmission

Together, these ensure uninterrupted, high-performance AI workloads.

4. How does ONES 3.0 visualize and monitor RoCE configurations in real time?

The ONES UI provides deep visibility into:

DSCP and 802.1p mapping to queues and priority groups
WRED and PFC stats for congestion handling
Scheduler profiles and queue usage across switches

This empowers network admins to proactively tune RoCE traffic and avoid disruptions.

5. What role does the ONES Rule Engine play in maintaining AI network performance?

The enhanced ONES Rule Engine enables proactive, automated management through:

Custom alert rules for RoCE, queue drops, and link failures
Slack/Zendesk integration for instant anomaly notifications
Granular threshold settings to prevent issues before they affect AI training
It turns ONES into an intelligent observability and incident response system.

6. How does advanced network observability help maintain lossless traffic for AI workloads?

Provides live metrics on packet drops, queue usage, and PFC status
Visualizes congestion points across GPU clusters
Correlates anomalies with traffic spikes for faster mitigation
Ensures AI data pipelines stay lossless under peak loads

7. Can an AI network assistant automate congestion handling for RoCE traffic?

Yes — an AI network assistant can:

Monitor PFC and ECN metrics continuously
Auto-tune scheduling weights based on real-time load
Trigger alerts or pause flows before congestion impacts GPU compute jobs
Maintain smooth RoCE packet flow with minimal manual intervention

8. Why is real-time queue monitoring important in GPU-centric AI networks?

In GPU-heavy environments:

Queue buildup signals impending congestion, risking packet drops
Monitoring helps tune WRED thresholds proactively
Admins can rebalance workloads to prevent data stalls
This keeps model training and inferencing pipelines running at optimal speed

9. How do centralized QoS views benefit network operations teams?

A centralized QoS dashboard enables teams to:

Review all DSCP, PFC, and WRED mappings in one place
Detect misconfigurations quickly
Compare live queue stats across multiple switches
Fine-tune policies without logging into each device separately

10. How does topology visualization support AI fabric reliability?

An interactive topology map offers:

End-to-end visibility of switches, servers, and GPU nodes
Instant health status and link performance
Rapid isolation of failures (e.g., faulty fans, down ports)
Better understanding of data flows, helping admins optimize AI cluster connectivity

Open Networking Enterprise Suite SONiC

Global Reach, Local Insight: ONES 3.0 Delivers Seamless Data Center Management

Post author By Anbarasan Ramalingam
Post date 6 November 2024

Explore the latest in AI network management with our ONES 3.0 series

ONES 3.0 introduces a range of exciting new features, with a focus on scaling data center deployments and support. In this blog post, we’ll dive into two standout features: ONES Multisite, a scalable solution for global data center deployments, and enhanced support for SONiC through tech support, servicenow integration and syslog message filtering. Let’s explore how these innovations can benefit your operations.

ONES Multi-site

The ONES rule engine enables incident detection and alert generation, but this data is limited to the specific site managed by each controller. While site data center administrators can use this information to address and resolve issues, enterprise-level administrators or executives seeking an overview of all data centers’ health must access each ONES instance individually, which can be inefficient.

To address this challenge, we introduce ONES Multisite—an application that provides a geospatial overview of anomalies across geographically distributed sites, offering a comprehensive view of the entire network’s health.

ONES instances in different data centers (DCs) around the globe can register with a central multisite application. Upon successful registration, the multisite system periodically polls each site for data related to the number of managed devices (endpoints) and the number of critical alerts. This information is displayed on a map view, showing individual sites, their health status, and last contact times. ONES Multisite also allows users to log in to individual data centers for more detailed information if needed.

Fig 1 – ONES Multisite showing DCs across the globe

To provide a quick overview of the health conditions at various sites, different colors and blinking patterns are used

Registering ONES instance with Multisite application

A simple user interface is provided for registering the ONES application to the multisite, requiring inputs such as the site name, multisite IP, and geographical coordinates ((latitude and longitude in N and E). By default, the current location coordinates of the site are auto-populated, but they can be overridden if necessary. License page of ONES application displays the status of registration status with the multisite application.

Fig 2 – Multisite Registration Window

Once registered, the multisite application will regularly gather data from each site regarding the number of managed devices (endpoints) and the count of critical alerts.

ONES Multisite streamlines the monitoring process across multiple data centers, enabling enterprise-level administrators to easily access vital information and maintain a holistic view of their network’s health. This enhanced visibility not only improves operational efficiency but also empowers teams to respond more effectively to incidents, ensuring optimal performance across all locations.

Enhanced support for SONiC using ONES 3.0

Tech support feature

SONiC Tech Support feature provides a comprehensive method for collecting system information, logs, configuration data, core dumps, and other relevant information essential for identifying and resolving issues. ONES 3.0 Tech Support feature offers an easier way to download the tech support dump from any managed switch. Users can simply select a switch and click on the Tech Support option. ONES controller connects to the switch, executes the tech support command, and notifies the user when the download file is ready. This powerful option allows data center administrators to easily retrieve tech support data without the cumbersome process of logging into each switch, executing the command, and downloading the file.

Fig 3 – ONES Tech Support page

Filtering of syslog messages

The Syslog feature empowers data center operators to easily view and download syslog messages from any of the managed switches through the ONES UI. This functionality is essential for monitoring system performance and diagnosing issues.

To enhance this feature, we’ve introduced a new enhancement that allows users to filter messages based on severity levels, such as error, warning, or all messages. This capability enables operators to quickly identify and prioritize critical alerts, streamlining the troubleshooting process and improving overall operational efficiency. By focusing on the most relevant messages, data center teams can respond more effectively to potential issues, ensuring a more reliable and robust network environment.

Fig 4 – Syslog messages with filter applied

ServiceNow Integration

ServiceNow is a cloud-based platform widely used for IT Service Management, automating business processes, and Enterprise Service Management. One of its core components is the ServiceNow ticketing system, specifically the Incident Management feature. When a user encounters a disruption in any IT service, it is reported as an incident on the platform and assigned to the responsible user or group for resolution.

The ONES Rule Engine proactively monitors the data center for potential disruptive events by creating alerts for any breaches of user-configured thresholds. It tracks various factors, such as sudden surges in CPU usage, heavy traffic bursts, and component failures (e.g., PSU, FAN).

ONES 3.0 enhances this functionality by integrating ServiceNow ticketing with the ONES Rule Engine and Alerts Engine. This integration allows ONES to automatically log tickets in the ServiceNow platform whenever any ONES rule conditions are met.

Fig 5 – Rule creation page with Service now integrated

Fig 6 – Service now platform with ONES tickets

In summary, ONES 3.0 brings significant advancements that cater to the evolving needs of data center management.

To unlock the full potential of ONES 3.0 and see how it can revolutionize your network operations, book your demo today

FAQs

1.What is ONES Multisite and how does it improve global data center monitoring?

ONES Multisite provides a centralized geospatial view of data center health across global sites, allowing enterprise administrators to monitor critical alerts and device statuses from a single interface drastically improving visibility and incident response times.

2.How does ONES 3.0 integrate with ServiceNow for automated IT incident management?

ONES 3.0 connects its built-in Rule and Alerts Engine with ServiceNow to automatically generate tickets for anomalies like CPU surges, component failures, or bandwidth spikes—ensuring streamlined IT service workflows and faster resolution times.

3.Can ONES 3.0 simplify SONiC switch tech support data collection?

Yes, ONES 3.0 introduces a simplified “Tech Support” feature that lets users download diagnostic logs from any managed SONiC switch with one click eliminating the need for manual CLI access across devices.

4.How does ONES 3.0 enhance syslog visibility and filtering for data center operations?

With advanced severity-level filtering (e.g., error, warning, info), ONES 3.0 helps operators quickly pinpoint critical syslog alerts from SONiC switches—accelerating root cause analysis and operational troubleshooting.

5.Why is ONES 3.0 considered essential for centralized AI data center management?

ONES 3.0 delivers single-pane visibility, ServiceNow integration, multisite scalability, and simplified support tools—making it the ideal centralized platform for managing complex, AI-powered, multi-vendor data center environments.

6. How does centralized network observability help with multi-site troubleshooting?

Centralized observability tools:

Aggregate health data from globally distributed sites
Visualize anomalies in a single geospatial dashboard
Reduce time spent switching between site-specific controllers
Enable faster root cause isolation for cross-site issues

7. Can a network operation tool integrate with existing ITSM systems?

Yes — a robust network operation tool can:

Monitor real-time data center health and threshold breaches
Auto-create tickets in platforms like ServiceNow
Sync incident status for better collaboration
Ensure IT service workflows remain connected to live network alerts

8. How does syslog severity filtering improve daily network operations?

Filtering syslogs by severity means teams can:

Focus first on critical errors that impact uptime
Suppress noisy, low-priority logs during high-severity events
Download filtered logs for quicker audits
Shorten mean time to detect (MTTD) issues in complex data centers

9. Why is an AI network assistant valuable for global data center teams?

An AI network assistant:

Correlates anomalies across multiple sites
Flags emerging patterns before they escalate
Suggests best-practice resolutions based on historical data
Reduces manual investigation, freeing engineers for strategic tasks

10. How does centralized tech support improve SONiC troubleshooting efficiency?

Centralized tech support:

Allows quick download of diagnostic bundles without CLI logins
Standardizes log collection across all SONiC switches
Provides clear data for vendor escalation
Cuts down troubleshooting time for remote sites with limited onsite staff

Open Networking Enterprise Suite SONiC

AI Fabric Orchestration: Supercharging AI Networks with SONiC NOS

Post author By Pramod Taramatta
Post date 6 November 2024

Explore the latest in AI network management with our ONES 3.0 series

As the demand for high-performance parallel processing surges in the AI era, GPU clusters have become the heart of data-intensive workloads. But it’s not just about the GPUs themselves—intercommunication between GPU servers is the backbone of their overall performance. Enters the network switch fabric, which is pivotal in overcoming communication bottlenecks and ensuring seamless data flow between GPU servers. Technologies like RoCE (RDMA over Converged Ethernet) allow massive chunks of data to move efficiently between servers, but ensuring that these critical data streams remain lossless and uncongested requires a powerful solution.

That’s where SONiC’s QoS (Quality of Service) features come into play. SONiC enables you to prioritize critical data traffic, ensuring high-priority packets are always transferred ahead of other traffic and also that your important data is not lost. Using SONiC’s robust QoS capabilities and ONES 3.0’s orchestration, you can turn your switch fabric into a lossless, priority-driven highway for GPU server communications.

Let’s explore how you can achieve this through SONiC via ONES 3.0 Fabric Manager orchestration tool.

Lossless And Prioritized Data Flow

Any packet entering the fabric with any DSCP/DOT1P marking can be mapped to any queue of the interface and enabling PFC on this queue makes it lossless. With PFC in place, when congestion is detected in the queue, a pause frame is sent back to the sender, signaling it to temporarily halt sending traffic of that priority. This mechanism effectively prevents packet drops, ensuring lossless transmission for traffic of particular priority.

Beyond PFC, there’s another layer of congestion management—Explicit Congestion Notification (ECN). With ECN, we can define buffer thresholds, exceeding which Congestion Notification (ECN-CNP) packets are sent to the sender, prompting it to reduce the transmission rate and proactively avoid congestion.

At this stage, we’ve ensured that our priority traffic is lossless. Moving into the egress phase, we can further enhance performance by prioritizing this traffic over others, even under congestion. SONiC provides scheduling algorithms like Deficit Weighted Round Robin (DWRR), Weighted Round Robin (WRR), and Strict Priority Scheduling (STRICT). By binding priority queues to these schedulers, the system can ensure that higher-priority traffic is transmitted preferentially, either in a weighted manner (for WRR/DWRR) or with absolute priority (for STRICT).

In summary, through PFC, ECN, and advanced scheduling techniques, SONiC ensures that high-priority traffic from GPU servers is not only lossless but also prioritized during both congestion and egress phases.

Simplifying Complex QoS Configurations with ONES Orchestration

Configuring SONiC’s complex QoS features may sound daunting, but with ONES 3.0’s seamless orchestration, it’s a breeze. ONES allows you to set up essential QoS configurations like DSCP to traffic-class mapping, PFC, ECN thresholds, and even scheduler types—all with a few lines in a YAML template. Here’s a snapshot of the YAML template showcasing how ONES orchestrates SONiC QoS (QoS is the section in YAML below)

Fig 1 – ONES UI AI Fabric Orchestration YAML Template

The Fabric Manager automates the creation and assignment of QoS profiles, saving administrators from manually configuring multiple aspects. Here’s how it works:

Mapping Traffic Classes and Queues

Orchestration begins by mapping traffic into appropriate classes and queues. ONES 3.0 Orchestration allows you to specify mapping values from DSCP (Layer 3) and dot1p (Layer 2) to traffic classes, traffic classes to queues, and traffic classes to priority groups (PGs). Upon specifying these mapping values, profiles would be created with standard namings using these mapping values like DOT1P_TC_PROFILE, TC_QUEUE_PROFILE, TC_PG_PROFILE, DSCP_TC_PROFILE and are binded to the interfaces that are part of the orchestration. This configuration ensures that each type of traffic is routed to its appropriate queue and handled correctly.

For example, we can specify mapping values in the YAML as above in image and FM will create the corresponding profiles and bind it to the interface as below:

Priority Flow Control (PFC) and Explicit Congestion Notification (ECN)

The next critical part of QoS orchestration involves Priority Flow Control (PFC), where ONES YAML allows users to define the queues that should be PFC-enabled. Moreover a PFC Watchdog can be configured to ensure that the PFC is well functioning with restoration, detection times and action to be taken in case of malfunctioning .

ECN configuration parameters can be provided in the YAML template using which ONES Fabric Manager creates a profile WRED_PROFILE and attaches it to all the queues that are PFC enabled for all the interfaces that are part of orchestration.

Here’s an example of how this would be configured on the interface for the YAML input in the above image.

This approach ensures that your network proactively manages congestion and minimizes packet drops for high-priority traffic.

Advanced Scheduling for Optimized Egress

Finally, Scheduling plays a vital role in controlling how packets are forwarded from queues. Orchestration allows administrators to choose between scheduling mechanisms such as Deficit Weighted Round Robin (DWRR), Weighted Round Robin (WRR), or STRICT priority scheduling, depending on their needs.

In the case of DWRR or WRR, weights can be assigned to each queue, influencing how often a queue is serviced relative to others. Upon specifying these parameters in the YAML, ONES-FM creates the scheduler policies (SCHEDULER.<weight>) each for a unique weight assigned to the queues and attach these created policies to the queues according to their weightage for all the interfaces that are part of the orchestration.

For instance in the below given image YAML input, there are two unique weights 60 and 40 that are assigned to queue 3 and 4 respectively. So, two scheduler policies SCHEDULER.40, SCHEDULER.60 are created and binded to the interface queues 3 and 4 respectively.

Now, here comes a question , what if all the queues are congested. How does the congestion notification packets even traverse through the network to reach the sender to stop or slow down the traffic coming in ?

ONES-FM provides an option to designate a specific queue for ECN_CNP (Explicit Congestion Notification packets) traffic, using STRICT scheduling, ensuring that even when the network is heavily congested there is always a room left for the congestion notification packets, preventing further blockages. cnp_queue under the ECN section in the above image represents that and is orchestrated as below by ONES-FM:

Flexible, Day-2 Support for QoS Management

One of the standout features of ONES-FM 3.0 is its support for Day-2 operations. As your network evolves and traffic patterns change, you can modify the QoS configurations through either the YAML template or the NetOps API. This flexibility ensures your network is always tuned to deliver the performance required by your AI workloads.

Future-Proof Your AI Infrastructure with ONES 3.0

With its intuitive YAML-based approach and support for dynamic Day-2 adjustments, ONES Fabric Manager eliminates much of the complexity associated with configuring and managing networks. ONES makes one confident that network infrastructure is both reliable and future-proof. In essence, ONES Fabric Manager enables seamless orchestration for AI fabrics, ensuring your network is always ready to meet the growing demands of AI-driven data centers.

FAQs

1. How does SONiC NOS enable lossless data transfer for GPU-based AI workloads?

SONiC NOS supports Priority Flow Control (PFC) and Explicit Congestion Notification (ECN), ensuring lossless, high-priority traffic flows—critical for real-time communication between GPU clusters in AI data centers.

2. What is the role of ONES 3.0 Fabric Manager in orchestrating SONiC QoS configurations?

ONES 3.0 provides YAML-based orchestration to simplify complex SONiC QoS settings like DSCP mapping, PFC, ECN, and queue scheduling, reducing configuration time and errors.

3. Can ONES 3.0 handle multi-vendor AI fabric orchestration?

Yes, ONES 3.0 is built for vendor-agnostic orchestration, managing QoS policies across different switches and interfaces—ideal for hybrid or evolving AI network environments.

4. How does ONES handle ECN and congestion feedback during network overloads?

ONES allows you to designate a CNP (Congestion Notification Packet) queue with STRICT priority, ensuring that even during congestion, ECN messages reach the sender to throttle traffic.

5. What scheduling algorithms does SONiC support for AI traffic prioritization?

SONiC supports DWRR, WRR, and STRICT scheduling, and ONES Fabric Manager lets you assign and orchestrate these policies via YAML—optimizing egress packet forwarding and queue handling in AI fabrics.

6. How does a network operation tool simplify complex QoS configurations for AI fabrics?

A robust operation tool:

Converts complex SONiC QoS rules into easy YAML templates
Automates DSCP, PFC, ECN, and scheduling bindings
Reduces human error in configuring queues and profiles
Enables consistent policy enforcement across GPU clusters

7. Why is proactive congestion management critical in AI switch fabrics?

AI clusters generate massive, bursty data streams. Without proactive congestion control:

Packet drops can stall GPU training jobs
RDMA traffic loses its lossless benefit
Network latency spikes, degrading AI workload throughput
Feedback loops (ECN) can’t function efficiently

Proactive management keeps traffic smooth and predictable.

8. Can an AI network assistant adjust QoS settings after deployment (Day-2 operations)?

Yes — a modern AI network assistant can:

Accept YAML changes or API calls to tweak DSCP mappings, PFC thresholds, or scheduling weights
Adapt policies dynamically as AI job patterns evolve
Ensure Day-2 changes don’t disrupt live traffic
Provide rollback and audit logs for safe adjustments

9. How does network observability enhance QoS orchestration?

Network observability feeds real-time telemetry into orchestration engines so they can:

Validate if PFC and ECN are behaving correctly
Detect queues nearing congestion
Trigger alerts or auto-tuning workflows
Optimize scheduling weights to balance loads dynamically

10. What happens if all queues in the AI fabric become congested?

Without smart design, congestion can block critical feedback packets. A centralized operation tool solves this by:

Allocating a dedicated queue for ECN CNP packets
Enforcing STRICT priority scheduling for that queue
Guaranteeing that congestion signals always reach the sender

Preventing a total traffic deadlock in the switch fabric

Open Networking Enterprise Suite SONiC

Streamlining AI Fabric Management: The Imperative of a Centralized Management Platform

Post author By Krupakar Annam
Post date 5 November 2024

Introduction

Artificial Intelligence (AI), once a mere buzzword, has now firmly established itself as a cornerstone of technological advancement. Its insatiable appetite for data fuels its continuous evolution, and generative AI, a subset capable of creating new content, is a prime driving force behind this growth. As datacenters become increasingly AI-centric and drive businesses worldwide, the networking community must assess their readiness for this transformative shift.

The Rapid Pace of AI Development

The pace of AI development is staggering, with years of progress potentially compressed into mere weeks. This rapid evolution necessitates a proactive approach from the networking community to ensure their solutions remain aligned with the cutting-edge advancements in AI. The challenge is multifold, as the increasing demand for networking switches and GPUs opens up opportunities for innovation in multi-vendor ecosystems and data center environments.

Fig 1 – GPU Market size and Trend

The Demand for Open and Flexible Networking Solutions

The rapid need for networking switches and GPUs has created a demand for multi-vendor ecosystems and data center environments. This increased demand for freedom from vendor locking has led to a surge in interest for open-source network operating systems (NOS) like SONIC for networking switches. The driving force behind this demand is the consolidation of features offered by multi-vendor hardware suitable for AI Fabrics and overall cost optimization.

Evolving Data Center Network Architectures

As data center network designs evolve from server-centric to GPU-centric architectures, the necessity for new networking topology designs such as fat-tree, dragonfly, and butterfly has become paramount. GPU workloads, including training, fine-tuning, and inferencing, have distinct networking needs, with Remote Direct Memory Access (RDMA) being the most suitable technique to handle high-bandwidth data traffic flows. Lossless networking and low entropy are also essential for optimal performance.

Fig 2 – Evolution of Data Centers

The Need for Centralized Management Solutions

A single pane of glass management tool is essential to streamline operations and optimize performance in multi-vendor AI fabric data centers. Such a tool should be capable of:

Addressing the Challenges of Centralized Management with ONES

Implementing a centralized management tool in a multi-vendor AI fabric data center requires careful consideration of several key challenges:

Aviz understands this need and has implemented ONES 3.0, a centralized management platform that provides comprehensive control over networking devices, AI workload servers and data centers.

Fig 3 – Aviz Open Networking Enterprise Suite (ONES) for AI Fabrics

The Future of Networking in the AI Era

As AI continues to evolve and its applications expand, the networking community must adapt to the changing landscape. By embracing open-source solutions, adopting new network topologies, and leveraging centralized management platforms like ONES 3.0, organizations can ensure their networks are well-equipped to support the demands of AI-driven workloads. The future of networking is inextricably linked to the advancement of AI, and those who are proactive in their approach will be well-positioned to capitalize on the opportunities that lie ahead.

All these cutting-edge innovations only mark the initial stride towards Aviz Networks’ vision, and more is yet to come. With our strong team of support engineers, we are well-equipped to empower customers with a seamless SONiC journey using the ONES platform.

As AI-driven networks grow in complexity, a centralized management platform like ONES 3.0 by Aviz Networks is essential. It provides seamless control, real-time monitoring, and multi-vendor compatibility to tackle the unique demands of AI workloads. Future-proof your network with ONES 3.0—because the future of AI fabric management starts here.

Explore more about ONES 3.0 in our latest blogs here

If you wish to get in touch with me, feel free to connect on LinkedIn here

FAQs

1. Why is centralized management essential for AI Fabric networks?

Centralized management platforms like ONES 3.0 simplify multi-vendor orchestration, offer real-time GPU and network telemetry, and streamline configuration and monitoring for evolving AI data center topologies.

2. How does ONES 3.0 address AI workload challenges in multi-vendor data centers?

ONES 3.0 supports vendor-agnostic infrastructure, enabling seamless control across switches, NICs, and GPUs, while delivering lossless RDMA optimization, topology orchestration (fat-tree, dragonfly), and proactive alerting.

3. What are the key features needed in an AI-centric network management tool?

Top features include:

Real-time infrastructure visualization
Multi-topology orchestration (fat-tree, dragonfly, butterfly)
GPU and NIC telemetry
Priority Flow Control (PFC)
End-to-end anomaly detection

4. Can ONES 3.0 support GPU-centric architectures and RDMA-based networking?

Yes, ONES 3.0 is optimized for AI/ML GPU workloads and RoCE-based RDMA traffic, enabling QoS profile automation, PFC watchdogs, and deep visibility into compute and network fabric.

5. What network topologies does ONES 3.0 support for AI workloads?

ONES 3.0 supports fat-tree, dragonfly, and butterfly network topologies, enabling scalable, high-performance designs tailored to the latency and throughput needs of modern AI fabrics.

6. How does a centralized network operation tool improve day-to-day AI fabric management?

A centralized tool offers:

Single pane of glass for switches, NICs, GPUs
Consistent configuration across vendors
Automated monitoring of lossless traffic flows (e.g., RDMA)
Faster troubleshooting through correlated network observability

7. Why is network observability critical for AI-centric data centers?

AI workloads generate massive, unpredictable traffic patterns. Robust network observability ensures operators can:

Track real-time performance across GPU clusters
Detect hotspots and microbursts
Analyze traffic flows to fine-tune QoS policies
Proactively prevent data flow disruptions in RDMA environments

8. Can an AI network assistant help manage complex multi-vendor fabrics?

Yes — an AI network assistant can:

Automate repetitive configuration tasks
Analyze telemetry from different switch and server vendors
Suggest optimizations for traffic scheduling and priority flow control
Trigger alerts and recovery workflows for anomalies in the AI fabric

9. What challenges do organizations face without centralized network visibility?

Without unified network visibility, operators often deal with:

Siloed data from multiple tools
Manual correlation of switch, NIC, and GPU logs
Slower root cause analysis for RDMA packet drops
Difficulty maintaining consistency in QoS and PFC settings across sites

10. How does centralized orchestration support new AI network topologies?

Centralized orchestration enables:

Easy mapping of DSCP to traffic classes for fat-tree, dragonfly, and butterfly topologies
Unified policy enforcement for lossless fabrics
Scalability as GPU clusters expand
Future-proofing for next-gen AI workloads that require dynamic reconfiguration

Open Networking Enterprise Suite

Announcing New Features in AI Network Management and Operations

Post author By Kasinath Rajendran
Post date 1 October 2024

We are thrilled to announce the release of ONES 3.0, a pivotal update in our ongoing innovation journey with Open Networking Enterprise Suite (ONES). This release furthers our mission of building ‘Networks for AI and AI for Networks,’ setting a new benchmark in network management and operations. With enhanced Visibility, AI Fabric Manager, and Support, ONES 3.0 is not merely an upgrade—it’s a significant stride forward. This version introduces advanced features that significantly boost the sophistication and efficiency of network operations, embodying our commitment to continuously push the boundaries of what’s possible in network orchestration and management.

"With the launch of ONES 3.0, we are enhancing the observability and orchestration of AI-Fabric network infrastructure tailored for GPU-centric workloads. This release offers improved visibility into compute metrics, including GPUs and network interface cards, enabling comprehensive end-to-end observability across multi-site AI infrastructures. Additionally, it strengthens fabric management with the inclusion of RoCE (QoS Profiles) configuration providing single-click Day 0 deployment for AI deployments. ONES 3.0 reflects our commitment to innovation, empowering customers to efficiently manage and optimize complex networks."

Chid Perumal, CTO, Aviz Networks

Key Features of ONES 3.0

ONES Multi-site

Multi-site offers a revolutionary way to visualize network anomalies across geographical locations. This intuitive, geospatial interface provides a comprehensive view of network health by representing anomalies on a map, making it easier to identify and address issues that span multiple sites. This feature is particularly valuable for organizations with geographically dispersed networks, as it allows for a unified and detailed perspective of network performance.

AI Fabric Manager

ONES AI Fabric Manager enhances the management and optimization of AI workloads, streamlining the deployment of AI/ML tasks across your network for efficient resource utilization. It automates the creation and assignment of QoS profiles, reducing the need for manual configuration.

Orchestration framework enables mapping of DSCP at Layer 3 and IEEE 802.1p at Layer 2 to traffic classes, which can then be linked to queues and priority groups. A key feature is Priority Flow Control (PFC), allowing users to define PFC-enabled queues for lossless traffic management. Additionally, a PFC watchdog can monitor functionality and initiate recovery actions if needed. The framework also supports ECN and various scheduling options, such as DWRR, WRR, and Strict Priority Scheduling for dynamic traffic management.

With AI Fabric visibility, administrators gain real-time insights into workload performance and resource utilization, facilitating proactive management. Detailed analytics help monitor trends, identify bottlenecks, and inform future capacity planning.

GPU and NIC Visibility

ONES 3.0 introduces a standout feature that enhances network performance by providing advanced visibility into GPU server metrics. The ONES agent on the server enables real-time monitoring of key metrics across network interfaces, GPUs, CPUs, and system-wide parameters, once integrated with the ONES platform. It supports a wide range of hardware vendors and configurations, ensuring adaptability and comprehensive monitoring. This capability is particularly valuable for tracking real-time server data and accommodating complex AI/ML workloads, ensuring that your network can handle even the most demanding computational tasks efficiently

ServiceNow Integration

Experience the powerful ONES Rule Engine and Alerts system, now integrated with ServiceNow ticketing. The ONES anomaly detection engine automatically reports issues, streamlining incident management. This integration connects ONES with your existing IT service management infrastructure, enhancing change control and overall network operations. The seamless integration simplifies the maintenance and optimization of your network environment

Support Enhancements

ONES now offers enhanced customer support through a single-pane access to the tech support page and syslog, providing comprehensive support resources and troubleshooting tools. The Tech Support feature allows for the efficient collection of system information, logs, configuration data, core dumps, and other critical data needed to diagnose and resolve issues. A new enhancement enables users to filter messages based on severity levels, such as errors, warnings, or all messages. This feature helps operators quickly identify and prioritize critical alerts, streamlining the troubleshooting process and improving overall operational efficiency.

A New Era of Network Management

ONES 3.0 features a suite of innovative functionalities and an enhanced user interface. This release revolutionizes network orchestration and management, providing the tools and capabilities needed to stay ahead in an increasingly complex network landscape.

To explore the full potential of ONES 3.0 and discover how it can transform your network operations, visit us at Aviz Networks Embark on your journey toward seamless network monitoring and orchestration today.

FAQs

1. What’s new in ONES 3.0 for AI network management?

ONES 3.0 introduces multi-site anomaly visualization, GPU and NIC telemetry, AI Fabric QoS orchestration, ServiceNow integration, and enhanced support tools—designed to streamline operations in AI-centric, multi-vendor networks.

2. How does ONES 3.0 improve visibility into GPU and server performance?

With agent-based telemetry, ONES 3.0 provides real-time GPU, CPU, NIC, and system-level metrics, enabling proactive monitoring of AI workloads across diverse hardware and vendors.

3. Can ONES 3.0 support multi-site network operations with AI observability?

Yes, ONES 3.0 includes geospatial multi-site anomaly detection, allowing operators to monitor network health and troubleshoot issues across multiple geographic data centers in a single pane.

4. What AI Fabric QoS features are included in ONES 3.0?

ONES 3.0 automates DSCP-to-queue mapping, Priority Flow Control (PFC), ECN, and scheduling policies like WRR, DWRR, and strict priority—ideal for RoCE-based, lossless AI fabric environments.

5. How does ONES 3.0 integrate with ServiceNow for incident management?

ONES 3.0 features native ServiceNow ticketing integration, enabling automatic issue creation via its rule engine—bridging network telemetry with ITSM workflows for faster incident resolution.

6. How does enhanced network observability help in AI workload optimization?

Provides real-time insights into traffic patterns, bottlenecks, and resource usage
Enables proactive adjustments to avoid congestion and performance degradation
Supports better capacity planning for compute-intensive AI workloads across clusters

7. What makes multi-site network visibility important for modern enterprises?

Modern enterprises often run AI workloads across multiple data centers. Multi-site network visibility allows operators to:

Detect and localize anomalies geographically
Correlate issues across sites for faster root cause analysis
Maintain consistent network health and performance across distributed infrastructure

8. How does an AI network assistant reduce manual troubleshooting?

An AI network assistant uses anomaly detection, rule engines, and automated workflows to:

Identify unusual network behaviors instantly
Trigger pre-defined recovery actions
Generate tickets automatically, linking telemetry data to incident records
Minimize time spent on routine checks and repetitive tasks

9. Can these network operation tools adapt to different hardware vendors?

Yes. Modern network operation tools and observability frameworks are designed to:

Collect telemetry from multi-vendor hardware (GPUs, NICs, switches)
Normalize metrics for unified dashboards
Allow administrators to manage diverse environments without vendor lock-in

10. What benefits do advanced support and syslog tools bring to network operations teams?

Simplified log collection and severity filtering streamline troubleshooting
Centralized tech support access reduces mean time to repair (MTTR)
Proactive alerts help prioritize critical issues before they impact AI workloads
Comprehensive logs and configurations enable faster root cause analysis

Open Networking Enterprise Suite SONiC

The Power of Choice in Networking: How The AI Stack Breaks Down Barriers

Post author By Ilona Gabinsky
Post date 27 August 2024

A lot of people ask me, “What are the problems that you are solving for customers?” At Aviz, we understand that modern networking demands more than just connectivity; it requires agile, scalable solutions that can adapt to the evolving demands of AI-driven environments. We’re tackling the challenges of complexity, vendor lock-in, and prohibitive costs that many face in traditional network setups. Our AI Networking Stack isn’t just about keeping your network running; it’s about advancing it to think, predict, and operate more efficiently.

At Aviz, we are reshaping networks for the AI era by pioneering both ‘Networks for AI’ and ‘AI for Networks’. Our AI Networking Stack offers unparalleled choice, control, and cost savings, designed to enhance orchestration, observability, and real-time alerts in a vendor-agnostic environment. We’re not just providing solutions; we’re transforming networks with advanced LLM-based learning for critical operations, ensuring powerful, open-source solutions that drive innovation at a fraction of the cost.

We lead the journey to redefine networking with a data-centric approach that seamlessly integrates with any switch and network operating system, delivering performance that rivals the top OEM solutions—all while focusing on the core pillars of choice, control, and cost-effectiveness.

So, if you value having choices, staying in control, and achieving cost savings, read on to discover how our innovative solutions can transform your network management experience.

Now, let’s take a closer look at what sets our technology apart. Here is the detailed overview of our AI Networking Stack:

We’ve meticulously developed each layer of our AI Networking Stack to address the unique challenges our customers face in today’s dynamic network environments. From foundational hardware choices to advanced AI-driven functionalities, let’s dive into the specifics of what makes our technology stand out in the industry.

First, let’s discuss why choosing a vendor-agnostic approach is so crucial. Imagine using Linux. Does it really matter you use it on HPE, DELL, Lenovo Servers, or even at AWS, Azure or GCP. That’s the kind of interoperability we bring to the networking world. Similar to what Linux did for the tech industry, we leverage the open-source SONiC operating system, enhanced by strong partnerships and robust community support. This approach offers an array of choices, enabling hardware selections from our partner ecosystem without any constraints.

At the heart of our innovative lineup is ONES (Open Networking Enterprise Suite), which empowers you with real-time visibility, seamless orchestration, advanced anomaly detection, and AI fabric functionality, including RoCE, across multiple vendors.

This means you have full control over which hardware solutions you implement, supported by our dedicated 24/7 customer service. ONES is designed to give you the freedom to manage your network without vendor lock-in, ensuring flexibility and control in your hands. Another critical aspect of our strategy to ensure cost efficiency is the Open Packet Broker (OPB).

Built on the powerful, community-driven SONiC platform, the OPB mirrors the capabilities of traditional Network Packet Brokers (NPB) but at a fraction of the cost. This solution delivers all the traditional functionalities you expect but optimizes them to offer significant cost savings without sacrificing performance or scalability.

Sitting atop our stack is the GenAI-based Network Copilot, your AI-powered assistant that simplifies all aspects of network management—from routine upgrades and audit reports to complex troubleshooting tasks. This tool is designed to enhance your operational efficiency, dramatically reducing the time and effort required for network management tasks, thereby freeing up your team to focus on strategic initiatives that drive business growth.

Our AI Networking Stack is designed to be the backbone of future network management, integrating advanced AI to navigate the complexities of modern networking with sophistication and ease. Opting for a vendor-agnostic approach provides the flexibility to choose the best technologies at the most effective prices, ensuring your network remains robust, scalable, and primed for future technological advancements.

Explore the benefits of a networking solution that brings choices, control, and cost savings without the constraints of traditional vendor lock-ins. This approach is not just about adopting new technology—it’s about advancing with a platform that understands and adapts to the evolving needs of your enterprise.

FAQs

1. What is an AI Networking Stack and why is it important for modern enterprises?

The AI Networking Stack refers to a multi-layered, AI-powered network management architecture that provides real-time orchestration, anomaly detection, and observability across multivendor environments. Aviz AI Stack combines ONES, Open Packet Broker (OPB), and Network Copilot™ to deliver flexibility, vendor-agnostic integration, and cost-efficiency—ideal for businesses modernizing their networks for AI-era workloads.

2. How does Aviz’s vendor-agnostic approach compare to traditional OEM network solutions?

Unlike traditional OEM networks that tie users into proprietary hardware and software, Aviz’s open-source SONiC-based stack supports interoperability across Dell, HPE, Cisco, Arista, and others. This vendor-agnostic approach lets enterprises avoid lock-in, reduce costs, and adapt faster to evolving infrastructure needs.

3. What are the core components of Aviz's AI Networking Stack?

Aviz’s stack includes:

ONES for orchestration, visibility, and AI fabric monitoring
Open Packet Broker (OPB) for cost-efficient traffic visibility
Network Copilot™ for AI-powered network insights and automation

These tools work together to deliver full-stack AI networking without dependency on any specific NOS or hardware.

4. How does Network Copilot™ use AI to improve network management?

Network Copilot™ leverages LLM-based AI to provide intelligent chat-based troubleshooting, performance diagnostics, upgrade checks, and real-time analytics. It streamlines operations and replaces repetitive tasks with intelligent automation—making it ideal for both network engineers and business leaders.

5. Why is SONiC crucial to enabling choice and flexibility in network infrastructure?

SONiC (Software for Open Networking in the Cloud) enables hardware-agnostic deployment, similar to how Linux enabled OS flexibility. Aviz builds on SONiC to offer full-stack solutions (ONES, OPB, Copilot) that allow organizations to choose hardware freely, while retaining full control over orchestration, observability, and AI integration.

Open Networking Enterprise Suite SONiC

ONE Data Lake & AWS S3 – Enhancing data Management and Analytics – Part 2

Post author By Rajasekaran S
Post date 23 July 2024

In February, we introduced the ONE Data Lake as part of our ONES 2.1 release, highlighting its integration capabilities with Splunk and AWS. In this blog post, we’ll delve into how the Data Lake integrates specifically with the S3 bucket of AWS.

A data lake functions as a centralized repository designed to store vast amounts of structured, semi-structured, and unstructured data on a large scale. These repositories are typically constructed using scalable, distributed, cloud-based storage systems such as Amazon S3, Azure Data Lake Storage, or Google Cloud Storage. A key advantage of a data lake is its ability to manage large volumes of data from various sources, providing a unified storage solution that facilitates data exploration, analytics, and informed decision-making.

Aviz ONE-Data Lake acts as a platform that enables the migration of on-premises network data to cloud storage. It includes metrics that capture operational data across the network’s control plane, data plane, system, platform, and traffic. As an enhanced version of the Aviz Open Networking Enterprise Suite (ONES), ONE-Data Lake stores the metrics previously used in ONES in the cloud.

Why AWS S3?

Amazon S3 (Simple Storage Service) is often used as a core component of a data lake architecture, where it stores structured, semi-structured, and unstructured data. This enables comprehensive data analytics and exploration across diverse data sources. S3 is widely used for several reasons:

S3 integrates seamlessly with a wide range of AWS services and third-party tools, significantly enhancing data processing, analytics, and machine learning workflows. This seamless integration allows for efficient data ingestion, real-time analytics, advanced data processing, and robust machine learning model training and deployment, creating a powerful and cohesive ecosystem for comprehensive data management and analysis.

S3 is engineered for complete durability, ensuring that your data is exceptionally safe and consistently accessible. This level of durability is achieved through advanced data replication across multiple geographically dispersed locations, providing robust protection against data loss and guaranteeing high availability.

S3 offers comprehensive security and compliance capabilities, providing a robust framework for safeguarding data and ensuring regulatory adherence. This includes advanced data encryption, both at rest and in transit, ensuring that sensitive information remains protected throughout its lifecycle. Additionally, S3 provides granular access management tools, such as AWS Identity and Access Management (IAM), bucket policies, and access control lists (ACLs), allowing fine-tuned control over who can access and modify data. These features, combined with compliance certifications for various industry standards (such as GDPR, HIPAA, and SOC), make S3 a secure and reliable choice for data storage in highly regulated environments.

S3’s capability to handle virtually unlimited amounts of data makes it an unparalleled choice for building and maintaining expansive data lakes that require storing massive volumes of information. This scalability empowers organizations to seamlessly scale their storage needs without upfront investments in infrastructure, accommodating growing data demands effortlessly. This capability is crucial for enterprises seeking to centralize and manage diverse data types, enabling advanced analytics, machine learning, and other data-driven initiatives with agility and reliability.

S3 provides flexible pricing models and a variety of storage classes to optimize costs based on data access patterns. Users can take advantage of storage classes like S3 Standard, S3 Intelligent-Tiering, S3 Standard-IA, S3 One Zone-IA, and S3 Glacier to manage expenses efficiently.

S3 offers robust data management capabilities, including versioning, lifecycle policies, and replication, which streamline data governance and archival processes. These features ensure data integrity, compliance, and resilience across various use cases, from regulatory compliance to disaster recovery planning. This capability empowers businesses to unlock the full potential of their data assets, supporting diverse applications such as predictive analytics, business intelligence, and real-time reporting with ease and efficiency.

S3’s robust features, including cross-region replication and lifecycle policies, establish it as an exceptional solution for disaster recovery strategies, ensuring data redundancy and resilience. Furthermore, S3’s lifecycle policies enable automated management of data throughout its lifecycle, facilitating seamless transitions between storage tiers and automated deletion of outdated or unnecessary data. Together, these features make S3 a reliable backup solution that enhances data durability and availability, providing organizations with peace of mind knowing their critical data is securely stored and accessible even in unforeseen circumstances.

Integrating S3 with ONES:

Steps involved to integrate the S3 cloud service with ONES,

To integrate the S3 service with ONES, follow these steps:

ARN Role: The unique identifier for the role that grants permissions to access specific AWS resources, including S3 buckets
Region: The AWS region where your S3 bucket is located
Bucket Name: The globally unique name of your S3 bucket
External ID(Optional): An external ID is an additional security measure used when granting cross-account access to IAM roles.

By accurately providing these details, you can effectively configure and integrate the S3 service with ONES, facilitating smooth metric collection and analysis.

The cloud instance created within ONES offers several management options to enhance user experience and sustainability. Users can update the integration settings, pause and resume metric uploads to the cloud, and delete the created integration when needed. These features make it easy for users to maintain and manage their cloud endpoint integrations effectively.

The end user has the flexibility to select which metrics from their network monitored by ONES should be uploaded to the designated cloud service. This ONES 2.1 release supports various metrics, including Traffic Statistics, ASIC Capacity, Device Health, and Inventory. Administrators can choose and deselect metrics from the available list within these categories according to their preferences.

The metric update is not limited to any particular hardware or network operating system (NOS). ONE-Data Lake’s data collection capability extends across various network operating systems, including Cisco NX-OS, Arista AOS, SONiC, and Non-SONiC. Data streaming occurs via the gnmi process on SONiC-supported devices and through SNMP on OS from other vendors.

S3 Analytical capabilities:

Analyzing data stored in an S3 bucket can be accomplished through various methods, each leveraging different AWS services and tools. Here are some key methods:

AWS Athena:

Description: A serverless interactive query service that allows you to run SQL queries directly against data stored in S3.

Use Case: Ad-hoc querying, data exploration, and reporting.

Example: Querying log files, CSVs, JSON, or Parquet files stored in S3 without setting up a database.

AWS Glue:

Description: A managed ETL (Extract, Transform, Load) service that helps prepare and transform data for analytics.

Use Case: Data preparation, cleaning, and transformation.

Example: Cleaning raw data stored in S3 and transforming it into a more structured format for analysis.

AWS SageMaker:

Description: A fully managed service for building, training, and deploying machine learning models.

Use Case: Machine learning and predictive analytics.

Example: Training machine learning models using large datasets stored in S3 and deploying them for inference.

Third-Party Tools:

Description: Numerous third-party tools integrate with S3 to provide additional analytical capabilities.

Use Case: Specialized data analysis, data science, and machine learning.

Example:Using tools like Databricks, Snowflake, or Domo to analyze and visualize data stored in S3.

Custom Applications:

Description: Developing custom applications or scripts that use AWS SDKs to interact with S3.

Use Case: Tailored data processing and analysis.

Example: Writing Python scripts using the Boto3 library to process data in S3 and generate reports.

Conclusion:

Aviz ONE-Data Lake serves as the cloud-native iteration of ONES, facilitating the storage of network data in cloud repositories. It operates agnostically across various cloud platforms and facilitates data streaming from major network device manufacturers like Dell, Mellanox, Arista, and Cisco. Network administrators retain flexibility to define which metrics are transferred to the cloud endpoint, ensuring customized control over the data storage process.

Unlock the ONE-Data Lake experience— schedule a demo on your preferred date, and let us show you how it’s done!

FAQs

1. What are the benefits of integrating ONE Data Lake with AWS S3 for network data storage?

Integrating Aviz ONE Data Lake with AWS S3 enables:

Centralized cloud storage for network telemetry
Unlimited scalability for growing datasets
Enhanced security with AWS encryption and IAM controls
Durable and highly available storage across regions

Flexible analytics through services like AWS Athena, Glue, and SageMaker
This combination helps enterprises achieve cost-effective, compliant, and powerful data management.

2. How can I configure AWS S3 integration with Aviz ONE Data Lake?

To set up AWS S3 integration with ONE Data Lake:

Provide your ARN role, region, bucket name, and (optionally) external ID
Configure your S3 instance on the ONES cloud interface

Select desired network metrics (e.g., Traffic Stats, Device Health) for uploading
This ensures seamless cloud metric collection customized to your organization’s needs.

3. What network telemetry metrics can be uploaded from ONE Data Lake to AWS S3?

With ONES 2.1, administrators can selectively upload metrics like:

Traffic Statistics
ASIC Capacity Metrics
Device Health and Platform Monitoring
Inventory Data
The flexibility to customize and filter metrics helps optimize storage costs and streamline analytics pipelines.

4. Does ONE Data Lake support multi-vendor telemetry for S3 uploads?

Yes!
Aviz ONE Data Lake collects and streams telemetry across multiple NOS platforms, including:

Cisco NX-OS
Arista EOS
SONiC
Cumulus Linux and other non-SONiC devices .It uses gNMI for SONiC and SNMP for other vendors, ensuring multi-vendor support without limitations.

5. How can AWS services like Athena, Glue, and SageMaker enhance analytics on S3-stored network data?

AWS Athena enables SQL-based querying directly on raw S3 data (no database setup needed).
AWS Glue automates ETL workflows, prepping raw network telemetry for structured analytics.
AWS SageMaker builds ML models using S3-stored datasets for predictive network optimization.

Together, these services transform raw network data into actionable insights and machine learning opportunities.

Open Networking Enterprise Suite SONiC

ONE Data Lake & Splunk: Revolutionizing Network Data Analytics – Part 1

Post author By Rajasekaran S
Post date 10 July 2024

A data lake serves as a centralized storage facility capable of accommodating large quantities of structured, semi-structured, and unstructured data on a significant scale.These are typically built using scalable distributed cloud-based storage systems, such as Amazon S3, Azure Data Lake Storage, or Google Cloud Storage.

A pivotal benefit of a data lake lies in its capacity to handle substantial amounts of data from diverse origins, offering a cohesive storage solution conducive to data exploration, analytics, and informed decision-making processes.

Aviz ONE-Data Lake functions as a platform facilitating the migration of on-premises network data to cloud storage. It encompasses metrics that capture operational data across the network’s control plane, data plane, system, platform, and traffic. Serving as an upgraded iteration of Aviz Open Networking Enterprise Suite (ONES), ONE-Data Lake stores the metrics previously utilized in ONES onto the cloud.

Why Splunk?

Splunk is highly significant for organizations across diverse industries for multiple reasons:

Splunk empowers organizations to obtain immediate insights from their operational data, facilitating the monitoring of system and application health and performance. This capability aids in promptly identifying and addressing issues, thereby reducing downtime and enhancing operational efficiency.

Splunk is extensively utilized for Security Information and Event Management (SIEM) objectives, aiding organizations in overseeing their IT environments for security threats and irregularities. By correlating data from diverse sources, it can efficiently identify and address security incidents, thereby bolstering the overall cybersecurity stance.

Splunk supports regulatory adherence and oversight by empowering organizations to gather, analyze, and report on data pertinent to regulatory requirements and industry standards. This capability is especially critical for sectors like finance, healthcare, and government, where stringent compliance mandates are in place.

Splunk aids in IT operations and DevOps practices by providing visibility into IT infrastructure, application performance, and deployment processes. This allows organizations to identify areas for optimization, streamline operations, and accelerate the development and delivery of software applications

Splunk equips organizations with machine learning and predictive analytics functionalities, empowering them to uncover patterns, detect anomalies, and forecast outcomes from their data. This supports proactive resolution of issues, capacity planning, and efforts in risk management

Splunk can be utilized to assess customer interactions and feedback from various channels, enabling organizations to delve deeper into customer requirements and preferences. This information can then be utilized to tailor offerings, elevate customer satisfaction levels, and nurture brand loyalty

To sum up, Splunk is an essential tool for organizations to leverage data efficiently, promoting operational excellence, strengthening security measures, ensuring compliance, and achieving business objectives.

Integrating Splunk with ONES:

Steps involved to integrate the Splunk cloud service with ONES,

To integrate the Splunk service with ONES, follow these steps:

By ensuring these details are accurately provided, you can successfully configure and integrate the Splunk service with ONES, enabling seamless metric collection and analysis.

Splunk Analytical capabilities:

Events within Splunk generally contain timestamped data alongside related metadata and content. Each event undergoes parsing and indexing separately, facilitating users to efficiently search, analyze, and visualize data. Splunk automatically extracts fields from events during indexing, streamlining filtering and correlation based on specific criteria.

This entails visually depicting data using charts or graphs, aiding users in comprehending patterns, trends, and relationships within the data more readily than analyzing raw data alone. These graphical representations encompass diverse types such as bar charts, line charts, pie charts, scatter plots, and others, each tailored to specific data types and analytical objectives

Conclusion:

Aviz ONE-Data Lake functions as the cloud-based version of ONES, enabling the storage of network data in cloud repositories. It operates independently of any particular cloud platform and supports data streaming from leading network device manufacturers such as Dell, Mellanox, Arista, and Cisco. Network administrators have the freedom to specify the metrics they want to transfer to the cloud endpoint, granting customized control over the data storage procedure.

Schedule your demo today because with ONE Data Lake integrated with Splunk, you’re not just managing data — you’re revolutionizing network analytics for unparalleled insights and efficiency.

FAQs

1. What is Aviz ONE Data Lake and how does it enhance network data analytics?

Aviz ONE Data Lake is a cloud platform that collects and stores telemetry from multi-vendor networks.It centralizes operational, traffic, and device health data — giving teams a single source for deep analytics, proactive decisions, and smarter network management.

2. How does integrating ONE Data Lake with Splunk benefit network teams?

Connecting ONE Data Lake with Splunk gives teams:

Real-time analytics
Powerful dashboards
Anomaly detection
It helps detect issues faster, optimize resources, strengthen security, and improve operational visibility across all network layers.

3. What types of metrics can be streamed from ONE Data Lake to Splunk?

You can stream:

Traffic statistics
ASIC utilization
Device health
Network inventory
from SONiC and non-SONiC devices like Cisco, Arista, Dell — using gNMI and SNMP protocols.

4. Is the ONE Data Lake and Splunk integration limited to specific vendors or NOS?

No!
ONE Data Lake is vendor-neutral.
It supports multi-vendor environments — including SONiC, Cisco NX-OS, Arista EOS, and more — so you get unified observability across your entire network.

5. How can users manage cloud integrations within ONES after setting up Splunk?

Through ONES, users can:

Update integration settings
Pause or resume uploads
Select which metrics to send

Delete integrations when needed
It’s simple, flexible, and designed for dynamic network environments.

Tags Post

Open Networking Enterprise Suite

From Hype to Reality: Navigating the Challenges of AI in Network Telemetry

Post author By Krupakar Annam
Post date 10 July 2024

AI is riding the crest of a technological wave, crowned the “Peak of Inflated Expectations” by Gartner’s 2023 Hype Cycle. Platforms like ChatGPT have become more than just buzzwords; they’re blazing a trail into a new era of technological possibilities. This isn’t a fleeting fad; it’s a fuel injection for innovation, poised to transform the landscape across industries including the Networking domain.

Think beyond chatbots and clever tweets. AI’s true potential lies in its ability to learn, adapt, and create. It can craft personalized experiences, generate realistic synthetic data, and even write code, all while pushing the boundaries of what we thought possible. This isn’t just about hype; it’s about harnessing the power of creativity to revolutionize the way we live, work, and play. So buckle up, because the AI revolution is just getting started. And this blog, let me give some insights into how AI can transform Network telemetry and enhance the experience.

Understanding Network Telemetry and applying AI

What is Network Telemetry?

Network Telemetry is the process of data collection, inspection, normalization and interpreting to generate information that helps the end user to visualize the network state and make decisions.

Beyond simply collecting data, network telemetry transforms it into actionable intelligence. Through meticulous analysis and normalization, it illuminates the network’s current state, enabling informed decisions and proactive interventions. Think of it as the network’s nervous system, providing a constant pulse of information for precise navigation.

Real-time Visibility: Critical for prompt issue detection and ensuring optimal performance. Proactive Issue Resolution
Performance Optimization:Insights into network traffic patterns, resource utilization, and application performance, helps to optimize configurations, allocate resources efficiently
Capacity Planning: Provides an overview of how network resources are utilized over time. helps to anticipate future demands and scale infrastructure accordingly
Troubleshooting and Diagnostics: Detailed network’s state helps with pin pointing root cause of the problems
Automation and Orchestration:Automated systems leverage telemetry for informed, dynamic adjustments in configurations and resource allocation based on current network conditions
Enhanced User Experience:Helps to develop tools that can be a single pane of glass visualization of entire network, this enhances the user experience

Harnessing the Power of AI for Network Telemetry

The convergence of AI and network telemetry represents a significant evolutionary leap in network management. By integrating AI’s analytical prowess with established telemetry infrastructure, we can unlock transformative benefits that enhance network security, optimize resource allocation, and streamline troubleshooting.

Elevating Network Intelligence:

Advanced Anomaly Detection: Move beyond basic pattern recognition. AI models, trained on historical data, can identify even subtle deviations from normal network behavior, providing early warnings of potential cyber threats or operational anomalies
Automated Root Cause Analysis: Eliminate the guesswork. AI-powered analytics delve into telemetry data to diagnose network issues with precision, significantly reducing troubleshooting time and minimizing service disruptions
Proactive Traffic Forecasting: Gain the upper hand on network congestion. AI can predict future traffic patterns with remarkable accuracy, enabling proactive resource allocation and infrastructure optimization to ensure a seamless user experience
Enhanced Network Resilience: Proactively address vulnerabilities before they become problems. AI models can simulate various network scenarios, identifying potential issues in new devices and configurations before deployment, paving the way for a more robust and secure network architecture
Data-Driven Insights at Your Fingertips: Unleash the power of conversational intelligence. AI-powered chatbots can translate complex telemetry data into readily digestible insights and actionable recommendations, empowering network teams at all levels with real-time knowledge

Beyond Hype, Embracing a Paradigm Shift:

The integration of AI into network telemetry isn’t just a technological trend; it’s a strategic imperative. By embracing this transformative technology, organizations can build a future-proof network infrastructure characterized by enhanced security, proactive efficiency, and informed decision-making. This is not a revolution, but an evolution, a seamless integration of AI’s capabilities to empower existing systems and propel network management to new heights.

Reframing the Challenges: Building Robust AI for Network Telemetry

While the promises of AI in network telemetry are vast, navigating its implementation requires careful consideration of several key challenges:

Data-Driven Foundations:

Data Quality and Availability: Robust AI models rely on diverse, high-quality telemetry data. Organizations lacking established data collection or struggling with inconsistent formats and evolving network topologies may face initial hurdles
Model Selection and Adaptation: Choosing the optimal AI model for your specific network needs and adapting it to your unique data landscape requires careful consideration and expertise. Continuous training and re-training also pose ongoing resource demands

Trust and Transparency:

Explainability and Transparency: Building trust in AI-driven insights requires understanding how models arrive at their conclusions. Explainable AI techniques are crucial to ensuring transparency and fostering user confidence
Hallucination Control: Models can inadvertently generate inaccurate or nonsensical data, often referred to as "hallucinations." Implementing robust safeguards and error detection mechanisms is essential to mitigate these risks

AI TRISM: Transforming Network Telemetry with Trust, Reliability, and Safety

Applying the AI TRISM framework to network telemetry unlocks a new era of trust, reliability, and safety in our connected world. Trust is bolstered by transparent models that explain how anomalies are detected and prioritized, allowing network administrators to understand and make informed decisions. Reliability soars through AI-powered anomaly detection, automatically pinpointing issues before they snowball into outages, while synthetic data generation ensures robust training even with limited real-world telemetry. Safety takes center stage as AI models learn to differentiate between harmless fluctuations and genuine threats, protecting critical infrastructure from cyberattacks and malicious actors.

Imagine a network humming with the silent symphony of AI. Anomalous blips in traffic flow are instantly flagged, not by rigid thresholds, but by AI models continuously learning the network’s healthy rhythm. Security threats are swiftly identified and neutralized, not through brute force, but by AI’s uncanny ability to discern friend from foe. This is the future of network telemetry, powered by AI TRISM – a future where trust, reliability, and safety weave a protective web around our increasingly interconnected lives.

We, at Aviz, are harnessing the power of AI to make significant improvements in the networking landscape. Expect even more advancements to come from us soon.

Contact us today because with our cutting-edge AI solutions, you’re not just navigating the hype — you’re transforming your network telemetry into a powerhouse of innovation, efficiency, and security.

FAQs

1. How can AI transform traditional network telemetry and observability?

AI enhances traditional network telemetry by enabling real-time anomaly detection, automated root cause analysis, predictive traffic forecasting, and proactive infrastructure optimization. It transforms telemetry from passive data collection into dynamic, actionable intelligence that improves security, resilience, and operational efficiency.

2. What are the key benefits of integrating AI into network telemetry?

Key benefits include advanced anomaly detection, faster troubleshooting through automated diagnostics, improved capacity planning with predictive analytics, proactive threat identification, and delivering real-time, human-readable insights via AI-powered chatbots and conversational interfaces for network teams.

3. What challenges should organizations address when applying AI to network telemetry?

Challenges include ensuring high-quality, diverse telemetry data, selecting and adapting the right AI models, building explainable and transparent systems, preventing AI “hallucinations” or false outputs, and continuously training models to align with evolving network topologies and threat landscapes.

4. How does the AI TRISM framework improve trust and safety in AI-driven network telemetry?

AI TRISM (Trust, Reliability, and Safety Management) improves network telemetry by enforcing transparency, reliable anomaly detection, and safe behavior prediction. It ensures AI models are explainable, resilient against adversarial inputs, and capable of differentiating real threats from harmless fluctuations.

5. Why is explainability important for AI models in network telemetry environments?

Explainability ensures that network teams can trust and understand AI-driven insights, making it easier to justify actions, detect false positives, and continuously improve model performance. Transparent AI builds operational confidence and fosters responsible decision-making in critical network environments.

Tags Post

SONiC

Network Observability

AI Network Assistant

Networks for AI

AI for Networks

Latest Blog

Why Partner with Us?

Latest Blog

Login to Partner Portal

Documentation

Validated Designs for SONiC

FAQs

Help

Support

Panel: Network Modernization in the Age of AI

Why This Matters

Explore the latest in AI network management with our ONES 3.0 series

Future of Intelligent Networking for AI Fabric Optimization

Why RoCE is Critical for Building AI Models

Enhancing RoCE Capabilities from ONES 2.0 to ONES 3.0

The Importance of QoS and RoCE in AI Networks

1. Comprehensive Interface and Performance Metrics

2. RoCE Config Visualization

3. Visual Traffic Monitoring: A Data-Driven Experience

4. Flexible Time-Based Monitoring and Analysis

Centralized QoS View

Comprehensive Topology View

Proactive Monitoring and Alerts with the Enhanced ONES Rule Engine

Conclusion

FAQs

Explore the latest in AI network management with our ONES 3.0 series

ONES Multi-site

Registering ONES instance with Multisite application

Enhanced support for SONiC using ONES 3.0

Tech support feature

Filtering of syslog messages

ServiceNow Integration

FAQs

Explore the latest in AI network management with our ONES 3.0 series

Lossless And Prioritized Data Flow

Simplifying Complex QoS Configurations with ONES Orchestration

Mapping Traffic Classes and Queues

Priority Flow Control (PFC) and Explicit Congestion Notification (ECN)

Advanced Scheduling for Optimized Egress

Flexible, Day-2 Support for QoS Management

Future-Proof Your AI Infrastructure with ONES 3.0

FAQs

Introduction

The Rapid Pace of AI Development

The Demand for Open and Flexible Networking Solutions

Evolving Data Center Network Architectures

The Need for Centralized Management Solutions

Addressing the Challenges of Centralized Management with ONES

The Future of Networking in the AI Era

FAQs

Key Features of ONES 3.0

ONES Multi-site

AI Fabric Manager

GPU and NIC Visibility

ServiceNow Integration

Support Enhancements

A New Era of Network Management

FAQs

Now, let’s take a closer look at what sets our technology apart. Here is the detailed overview of our AI Networking Stack:

FAQs

Why AWS S3?

Integrating S3 with ONES:

S3 Analytical capabilities:

AWS Athena:

AWS Glue:

AWS SageMaker:

Third-Party Tools:

Custom Applications:

Conclusion:

FAQs

Why Splunk?

Integrating Splunk with ONES:

Splunk Analytical capabilities:

Conclusion:

FAQs