Research PaperNO. 2026-ARCH-04.6

GPU & Accelerator Management

Telemetry, Thermal Management, and BMC Integration for AI Infrastructure

H100 TDP
700W
Per GPU, 5.6kW/node
Thermal Margin
83°C
Throttling threshold
Telemetry Interval
<1s
Required for thermal safety
01.

The GPU Management Challenge

Modern AI clusters are dominated by high-TDP accelerators—NVIDIA H100s at 700W per GPU, with 8-GPU nodes consuming 5.6kW+ of GPU power alone. These thermal densities create unprecedented management challenges where sub-second telemetry is critical for safety, and traditional BMC polling intervals are insufficient.

Thermal Reality

  • H100: 700W TDP, 83°C throttle point
  • 8-GPU node: 5.6kW GPU + 1kW system = 6.6kW
  • Thermal runaway in seconds without cooling
  • Liquid cooling becoming mandatory

Monitoring Requirements

  • Sub-second temperature polling
  • Power consumption per GPU
  • Memory errors (ECC) tracking
  • NVLink/NVSwitch health status

Thermal Gradient Analysis [H100 Cluster]

Sample / 500ms Interval
T-0ms
Peak Load
T+500ms

Time-series observation showing thermal gradient during compute burst. 500ms sampling interval is insufficient for safe thermal management at H100 power levels.

02.

Neocloud Pain Points

GPU-BMC Integration Gap

NVIDIA DCGM (Data Center GPU Manager) operates independently from BMC systems. GPU telemetry is not exposed through standard Redfish APIs. Operators must maintain parallel monitoring stacks—one for server health via BMC, another for GPU health via DCGM/nvidia-smi—creating operational complexity and potential blind spots.

NVMe Streaming Telemetry

Standardized real-time telemetry for NVMe drives is inconsistent. Different vendors implement proprietary extensions. For AI workloads with checkpoint I/O bursts, predicting drive failures before data loss requires vendor-specific tooling that doesn't integrate with Redfish-based monitoring.

Liquid Cooling Standards

The 1000W+ TDP of next-gen GPUs necessitates liquid cooling. There are no standardized pressure and coolant flow monitoring interfaces. Each cooling vendor implements proprietary sensors and control protocols, complicating unified thermal management.

GPU Utilization Visibility

Industry average GPU utilization is estimated at only 40-50%. Without integrated telemetry correlating GPU utilization with workload scheduling, operators cannot optimize placement or identify underutilized capacity—a significant cost concern at $30,000+/GPU pricing.

03.

Current GPU Management Ecosystem

NVIDIA Tooling

  • nvidia-smi: CLI for GPU status/control
  • DCGM: Data Center GPU Manager daemon
  • NVML: Low-level management library
  • Fabric Manager: NVSwitch orchestration

Integration Approaches

  • Prometheus Exporter: dcgm-exporter for metrics
  • Kubernetes: GPU Device Plugin + NFD
  • Custom Bridge: DCGM → Redfish translation
  • OOB Channel: GPU-to-BMC I2C/SMBus

Integration Challenge: There is no standard mechanism to expose GPU telemetry through Redfish. DMTF has working drafts for Accelerator resources, but implementation varies. OCP is working on specifications to bridge this gap.

04.

Relevant OCP Workstreams

The following OCP projects and sub-projects are actively working on specifications and contributions that address the challenges outlined in this research.

Server - Open Accelerator Infrastructure

View Project

Developing specifications for accelerator integration including thermal interfaces, power delivery, and management APIs.

Open Accelerator InfrastructureAI HW SW CoDesign

Cooling Environments

View Project

Specifications for liquid cooling systems including cold plates, CDUs, and heat reuse—critical for next-gen GPU thermals.

ImmersionCold PlateCoolant Distribution UnitHeat Reuse

Hardware Management

View Project

Working on extending Redfish profiles to include accelerator telemetry and standardized GPU management interfaces.

Hardware Fault ManagementScalable Cloud Infrastructure Management
05.

OCP Contributions

The following contributions are available through the OCP Contributions portal. These include reference implementations, specifications, and design documents.

GPU & Accelerator RAS v1.0 Final

Specification

Complete RAS requirements for GPU and accelerator hardware including telemetry and error handling.

Contributor: OCP Server ProjectView Contribution

GPU Accelerator Mgmt Interfaces v0.9

Specification

Standardized management interfaces for GPU telemetry and control operations.

Contributor: OCP Hardware ManagementView Contribution

White Paper: Open Cluster Designs for AI

White Paper

Architecture patterns for AI/GPU clusters including cooling and monitoring considerations.

Contributor: OCP Future TechnologiesView Contribution

View all contributions at opencompute.org/contributions