Episode 10: Networks for AI and AI for Networks with Nvidia and Aviz

### Ilona Gabinsky [00:00:07]

My name is Ilona Gabinsky, and this is Aviz podcast series. Today, we will be talking about Networks for AI and AI for Networks with NVIDIA and Aviz, and about the pivotal role of networking in enabling AI at scale from training and inference to long-term operations. Today, we have a very special guest, Taylor Allison from NVIDIA. Taylor is a Senior Product Marketing Manager and is responsible for marketing NVIDIA’s networking for AI platforms, including Spectrum-X Ethernet, Quantum InfiniBand, and NVLink. Taylor has a passion for product marketing and management in the accelerated infrastructure space with a particular expertise in AI and HPC. Taylor, welcome to the show.

### Ilona Gabinsky [00:00:58]

So, Taylor, you’ve been in tech for a while. What’s the biggest “wow” moment you’ve had watching networking evolve?

### Taylor Allison [00:01:08]

Yeah, thanks for having me here, Ilona. It’s a pleasure to be here. That’s a great question, and I can answer it enthusiastically. The biggest wow moment is happening right now. We are in this incredible time with such change, such dynamism, such adaptation to meet the needs of these workloads for AI. My background, I spent a lot of time working in high-performance computing. We had a sense of scale, and high-performance computing was doing computing at a scale that at the time had never been seen in any other enterprise application. But here we are in this world where we have hundreds of thousands of GPUs being networked together at incredibly high speeds with these bleeding-edge techniques to really optimize performance. It feels like we’re living history right now, and I can’t imagine being in a better position. I look forward to being able to tell my grandkids one day about where I am now, and the challenges and opportunities that we encounter in this modern age. It’s incredible.

### Ilona Gabinsky [00:03:14]

Yeah, I absolutely, totally agree. And actually, would love to hear what you would say to your grandkids. It’s interesting to see where the future holds and how it all evolves because obviously we are in this great moment, some like revolutionary moment with this new great technology coming in. So where we will all be, where your grandkids will be here, I would love to see that. Right now we can only predict or guess what will happen. But let me now bring the conversation back to networking. So considering all this AI era and that everything evolves really around AI right now, why do you think the network is really critical, is becoming a critical foundation for that AI training and inference?

### Taylor Allison [00:04:16]

Yeah, so great question. I’ll give the in-a-nutshell answer, a summary that our CEO, Jensen Huang, has said, and I think it really just nails the answer to that question. The data center is the new unit of compute. And so what that means is all these challenges, AI, it’s all a data center scale problem, a data center scale challenge. And to attain that data center scale, you have these hundreds, thousands of servers containing accelerated compute infrastructure like GPUs. They’re all networked together, and they all have to talk to each other. And that’s really the challenge of AI, is that as a workload, it’s incredibly sensitive to that network communication, and it’s very highly synchronized. And so you have this world where each of these GPUs has to talk to each other GPU, and they each are doing their own little piece of the puzzle, doing the work, and then when they’re done with the computation, the communication has to happen. And at that point, you have all these gradients, the work that’s done by each GPU is sent over the network, propagated to each other GPU, and you can only move on to the next step of a training workload when each of those gradients has made its way to each of the GPUs.

So you have this sensitivity not only to average performance, but even in the worst case, you sort of need this high performance. You need low latency because you have to wait for that last communication to happen before you can move on to the next step of the AI workload. So that’s the scenario for training, but it’s really true for inference as well because inference, once you’ve trained your model and you deploy it, and you have these queries that are coming in from tons of users at the same time, the model has to, it takes in that query, everything gets converted into tokens, and it calculates what the answer is and has to send it back. Each of those things has this reliance on the network, and latency remains critical there just as it does in the training case.

So with both training and inference, you have this really critical requirement that you have high effective bandwidth for your network, you have really low latency. This is the backdrop for why NVIDIA created this new platform called Spectrum-X, which is the world’s first Ethernet built for AI, and we introduced that into the market. And now that’s along with our InfiniBand platform, Quantum InfiniBand. Those two are what we say is the standard for this scale-out workload where you’re communicating beyond servers, beyond the NVLink domain for AI.

### Ilona Gabinsky [00:07:43]

Thank you very much for such a comprehensive answer, Taylor. At Aviz, we’ve been working very closely with NVIDIA to help customers orchestrate Spectrum-X fabrics across multiple tenants and life cycle phases. So from your point of view, what’s most important across Day 0, Day 1, and Day N when building and scaling AI infrastructure?

### Taylor Allison [00:08:08]

Yeah, so at the end of the day, you’re trying to get the most out of your AI infrastructure. Whether you’re doing training, in that case, you want to train your model as fast as possible, and you want to get the most utilization out of your infrastructure. When you’re doing inference, you want to have time to first token to be as fast as possible and your token velocity to be as high as possible. In both cases, you want maximum performance, maximum uptime of your infrastructure, maximum high availability. With Day 0, you want to make sure that you’re up and ready to go by the time the hardware rolls in and you’re racked and stacked and cabled. You want to do as much as you can ahead of time. At NVIDIA, we have a tool called NVIDIA Air, which is a cloud-based digital twin platform for creating digital twins of your network. And you can use that to create this hardware-free PoC, do things like testing your automation, your security, any of those things you mentioned that you’re doing orchestration with with ONEs and the Aviz platforms. Once you understand what your protocols are going to look like, you can test them all in Air without having any of that hardware before anything gets to your data center.

With Day 1, once you’re deploying this stuff, you want to make sure that it’s all smoothly deployed. And so again, that’s where these orchestration tools come into play. Things like intent-based networking from Aviz’s offerings, combined with some of the toolings that we have at NVIDIA, really accelerate the pre-prep of your automation, your security, any of your monitoring and management, so that you get all your hardware into the data center and you get up and running as fast as possible.

For the ongoing operations with Day N, you want your downtime to be as minimal as possible. You want any upgrades you make to be non-disruptive and ensure that your AI factory remains operational and is still producing this intelligence via tokens. And so that requires a really smooth, seamless CI/CD pipeline. And again, the power of Air, you can test any of these configuration changes. If you’re rolling out a new approach to your VXLAN or whatever, changing parameters, you can test all that ahead of time with Air and make it super seamless with orchestration platforms like ONEs. And so I think that this idea of the Day 0, Day 1, Day N, especially for these multi-tenant environments, which are just really what we’re all kind of building towards because everything has to be multi-tenant to really, that’s the modern infrastructure that we’re accustomed to. So those combinations of like the NVIDIA Air platform for this ecosystem with ONEs is a really powerful combination, and that’s how you achieve that highest ROI on your infrastructure.

### Ilona Gabinsky [00:12:00]

This is great. Thank you, Taylor. As you know, at Aviz, we help customers build networks for the AI era, and as well as integrate AI into networks. So I would like to shift the perspective a little bit from Networks for AI that you just gave such a comprehensive answer, and I’m sure if our listeners have any questions, please reach out to us and we’ll be happy to provide a more detailed information on how we provide orchestration for Spectrum-X fabrics. But right now, I would really like to shift the perspective a little bit from Networks for AI to AI for Networks. So you know that we’ve been developing AI Network Copilot for such use cases as network audits, upgrades, troubleshooting, and others. It’s really an assistant to network engineers and network architects to help teams stay ahead as the complexity grows and to take care of all the mundane work that they need to deal with every day. So what’s your vision for how AI can support these day-to-day network operations in the future?

### Taylor Allison [00:13:15]
Yeah, great question and a super important topic. Even today, NetOps tools are vital. Whether you’re at a smaller scale, say you’ve got just a couple of network admins, they need the help with this. They need these toolings to help handle this quite complex stack that they’re given. At a larger scale for clouds, I think it becomes even more important because you sort of have this world, not to go back too much into networking for AI, but just to touch on it briefly again, just this idea that you have this infrastructure that is now, it’s really being cloudified. We’re talking these massive data centers. You want to get as much out of them as you can, and that probably involves a mixture of workloads in the environment, both training and inference. And again, that requires this multi-tenancy. And you’re talking a scale where you just have thousands, tens of thousands of network devices. You need all the help you can get from your tools.

And so, I think that these sorts of tools like AI Network Copilot become even more important as time goes by. The importance of things like anomaly detection, cybersecurity, all of this just becomes vital as scale increases, as these workloads that are running in this infrastructure, they truly are mission-critical for the companies that are running them. This is how you make your money. This is how you provide for your customers. And so, the power of these tools is only growing as time goes by too. These AI models that are powering things like AI Network Copilot get more sophisticated. It’s kind of a feedback loop between AI for networking and networking for AI. So I think that going forward, obviously the human in the loop, so to speak, your network admin, is going to remain incredibly important. If an optic or a transceiver fails, you need a person to go in there and swap that out. We’re not quite at the age where we’re all replaced by robots. But again, the requirements on the network admin are only growing, and so enhancing their productivity, sort of automating away kind of the complex or the mundane with tools like AI Network Copilot is is vital. And I think it’s already important today, it’s only going to get more important in the future.

### Ilona Gabinsky [00:16:18]
Thank you very much, Taylor. That’s been a great conversation. I have just one last question for you. So outside of work, what’s something you geek out on just as much as AI and networking?

### Taylor Allison [00:16:32]
Great, great question. I think truly, the my passion is in music. I love to play music, I love to sing. I could show you I’ve got a bunch of instruments not far away from me off-camera that I just, as stress relief, like sometimes you just sit down and play the piano, play the ukulele. And there’s just something so powerful and and so humanizing, you know. Again, we talk about AI, and I already geeked out about how cool it is, but there’s something really special in these things that people create and a uniquely human experience listening and and playing music.

### Ilona Gabinsky [00:17:25]
Well, yeah, this is great. Thank you very much, Taylor, for such a great conversation. Thank you for your time. Really appreciate it. And our listeners, if you have any questions, please feel free to reach out to us, and stay tuned for our new episode.

### Taylor Allison [00:17:40]
Awesome. Awesome. Thank you so much.

### Ilona Gabinsky [00:17:44]
Thank you, Taylor.

SONiC

Network Observability

AI Network Assistant

Networks for AI

AI for Networks

Platform Integrations

Latest Blog

Events

Why Partner with Us?

Latest Blog

Login to Partner Portal

Executives

Network Engineers

Network Operators

Procurement

Documentation

Validated Designs for SONiC

FAQs

Help

Support

Technical Support

Transcript

Products

Solutions

Company

Quick Links

Subscribe to Aviz latest updates

Subscribe to Our Newsletter

Episode 10: Networks for AI and AI for Networks with Nvidia and Aviz

SONiC

Network Observability

AI Network Assistant

Networks for AI

AI for Networks

Platform Integrations

Latest Blog

Events

Why Partner with Us?

Latest Blog

Login to Partner Portal

Executives

Network Engineers

Network Operators

Procurement

Documentation

Validated Designs for SONiC

FAQs

Help

Support

Technical Support

Transcript

Subscribe to Aviz latest updates

Subscribe to Our Newsletter

Contact Us

Episode 10: Networks for AI and AI for Networks with Nvidia and Aviz