NVIDIA Developer Blog·Hardware·3d ago·by Christian Shrauder·~3 min read

Introducing NVIDIA Fleet Intelligence for Real-Time GPU Fleet Visibility and Optimization

The compute capability of large GPU fleets presents unprecedented opportunities to innovate and provide value to customers in record time. Yet these advancements come with a variety of challenges. At scale, teams are juggling heterogeneous hardware, fast‑moving software stacks, tight power envelopes, and spiky, multitenant workloads. A single hotspot, misconfigured driver, or subtle hardware fault can ripple, causing throttled jobs, missed SLAs and wasted spend. As well, the complexity and number of components involved in large-scale clusters can be daunting, so it’s essential to maintain visibility into the day-to-day operations and understand the operational state at any given time. Monitoring GPU utilization and identifying bottlenecks during job execution becomes more difficult. Identifying areas of low utilization and migrating workloads to them is one of the best ways to ensure the highest return on investment. For these reasons, GPU‑aware monitoring is essential at scale. Teams need visibility beyond whether or not the node is up. They need to know whether, at any given moment, every accelerator is performing as expected, safely, and consistently. This post introduces NVIDIA Fleet Intelligence, an agent-based managed service for continuous monitoring of NVIDIA data center GPUs. It is now generally available. What are the key focus areas of GPU monitoring? Important areas of GPU monitoring include power, temperature, performance, health, and uniform configuration. - Power: Track power utilization and throttling to stay within data center budgets while maximizing performance per watt. - Temperature: Detect hotspots and airflow issues early to avoid thermal throttling and premature component aging. - Performance: Watch utilization, memory bandwidth, interconnect health, and throttling reasons to spot regressions and imbalance across the fleet. - Health: Surface ECC and XID errors, retired pages, HBM/NVLink/PCIe anomalies, and other RAS signals to catch failing parts before they fail. - Uniform configuration and integrity: As part of GPU inventory validation, check for consistent drivers, firmware, and BIOS settings to ensure reproducible results and safe operation, as well as verify firmware integrity. What is NVIDIA Fleet Intelligence? NVIDIA Fleet Intelligence is a low-level, deployment-agnostic managed service that can be used regardless of software stack or scheduler choice. Initially, the service supports data center GPU and CPU customers that are managing their own infrastructure, and engineers requiring more insight into GPU and CPU behavior. The service leverages technology and IP from across the NVIDIA portfolio of products and learnings from running the NVIDIA fleet of hundreds of thousands of GPUs across NVIDIA DGX Cloud. Fleet Intelligence uses a low-footprint, host-based agent to stream GPU telemetry back to the fully managed Fleet Intelligence cloud service. NVIDIA is releasing the Fleet Intelligence agent as an open source project for auditability. The agent leverages other NVIDIA open source solutions such as GPUd, NVIDIA Data Center GPU Manager (DCGM), and the NVIDIA Attestation SDK. To learn more, visit NVIDIA/fleet-intelligence-agent on GitHub. Fleet Intelligence has been developed with feedback from early access (EA) customers, including NVIDIA Cloud Partners (NPCs), Lambda and IREN. This GA release focuses on three main areas: - Inventory and visualization…

Introducing NVIDIA Fleet Intelligence for Real-Time GPU Fleet Visibility and Optimization — image 2

#gpu

read full article on NVIDIA Developer Blog →

0login to vote