Back to Research
Research PaperNO. RES-NEOCLOUD

The Neocloud Hardware Management Stack: A Comprehensive Industry Deep-Dive

The single most important finding for neocloud operators: the hardware management ecosystem is converging rapidly around three pillars — Redfish as the universal API, OpenBMC as the firmware platform, and OCP as the standards body.

Published
Apr 2025
April 1, 2025
Read Time
45 min
Estimated
Topics
6
Redfish, OpenBMC
RedfishOpenBMCOCPIPMIGPU ManagementBare Metal

The neocloud hardware management stack: a comprehensive industry deep-dive

The single most important finding for neocloud operators is this: the hardware management ecosystem is converging rapidly around three pillars — Redfish as the universal API, OpenBMC as the firmware platform, and OCP as the standards body — yet critical gaps in GPU-specific management, BMC security, and Day-2 automation leave smaller operators building custom tooling from scratch. The tooling gap between hyperscaler internal platforms and what's commercially available represents both the defining challenge and the largest opportunity in the neocloud infrastructure space. This report covers the full stack from low-level firmware protocols through orchestration platforms, maps every major vendor's capabilities and weaknesses, and provides a prioritized action plan for neocloud CTOs deploying GPU clusters at scale.


1. Foundational protocols: from IPMI's frozen legacy to Redfish's rapid evolution

IPMI is dead — but its ghost haunts every data center

The Intelligent Platform Management Interface was created by Intel, HP, NEC, and Dell in September 1998 (v1.0), with v2.0 arriving in February 2004. The last specification update — v2.0 revision 1.1 Errata 7 — shipped on April 21, 2015, and no further development is planned. IPMI remains universally deployed for backward compatibility, providing power control, sensor reading, serial-over-LAN, and system event logs through a binary protocol over UDP port 623.

IPMI's security model is fundamentally broken at the specification level. Cipher Suite 0 bypasses authentication entirely, affecting roughly half of exposed IPMI 2.0 implementations. The RAKP authentication handshake leaks password hashes that can be brute-forced offline — a flaw that cannot be fixed without breaking the specification. Version 1.5 has no encryption whatsoever, and v2.0 requires BMCs to store cleartext passwords for HMAC authentication. Default credentials remain endemic: Dell ships root/calvin, Supermicro uses ADMIN/ADMIN, and IBM defaults to USERID/PASSW0RD. Tools like ipmitool, FreeIPMI, and ipmiutil remain essential for managing legacy hardware, but every new deployment should treat IPMI as a compatibility shim, not a primary interface.

Redfish has become the industry's management API standard

DMTF Redfish, first released in August 2015, has evolved into the definitive hardware management API. The current specification version is DSP0266 v1.23.1 (December 2025), with the latest schema release Redfish 2025.4 shipping January 2026. DMTF has maintained an aggressive quarterly release cadence, delivering four schema releases in 2025 alone.

Redfish replaces IPMI's binary protocol with a RESTful API over HTTPS using JSON payloads and OData v4 semantics. The service root at /redfish/v1/ exposes a hierarchical resource tree: Systems (compute nodes), Chassis (physical enclosures), Managers (BMCs), plus dedicated services for accounts, sessions, events, firmware updates, telemetry, certificates, and tasks. Authentication supports session tokens, Basic Auth, OAuth, multi-factor authentication (added 2022.3), and client certificates. The eventing model provides both Server-Sent Events (SSE) for push notifications and webhook-style event subscriptions.

Recent schema additions reflect the industry's shift toward AI infrastructure. Redfish 2025.2 introduced VirtualCXLSwitch and VirtualPCI2PCIBridge schemas (co-developed with the CXL Consortium), TelemetryData for bulk device telemetry, and UpdateServiceCapabilities for staged fleet-wide firmware updates developed with OCP input. Redfish 2025.4 added UALink (Ultra Accelerator Link) support in Port and Processor resources. GPU management is modeled through the Processor resource type with ProcessorType=GPU, ProcessorMetrics for utilization and temperature, and the Fabric model for interconnects. SNIA Swordfish (currently v1.2.8, January 2025) extends Redfish for storage management with StorageService, Volume, and FileSystem resources.

OpenBMC is winning the firmware platform war

OpenBMC, a Linux Foundation project born from a 2014 Facebook hackathon, has emerged as the dominant open-source BMC firmware platform. The latest stable release is v2.18.0 (May 2025), backed by 2,847 contributors from 266 organizations with an estimated software value of $1.6 billion. The Technical Steering Committee includes representatives from Google, Meta, IBM, Microsoft, Arm, and Intel, with NVIDIA, AMD, and Ampere as significant contributors.

The architecture is Yocto/OpenEmbedded-based, using D-Bus as the central IPC bus with phosphor-dbus-interfaces defining communication contracts between services. The bmcweb HTTP server provides DMTF-compliant Redfish, WebSocket-based KVM, and serial console access. Vendor customization happens through meta-layers (meta-facebook, meta-google, meta-ibm, meta-intel-openbmc), enabling 80%+ shared codebase across diverse hardware platforms.

OpenBMC is deployed in production at Meta, Google, Microsoft, and IBM, and is rapidly expanding into enterprise. Dell's iDRAC10 is built on OpenBMC foundations. Lenovo's XCC3 (ThinkSystem V4) adopts OpenBMC. AMI announced in October 2025 a unified, SLA-backed OpenBMC codebase with their MegaRAC Community Edition. The market is projected to grow from $1.2 billion (2024) to $4.8 billion by 2033 (16.7% CAGR).

Protocol comparison at a glance

DimensionIPMIRedfishOpenBMC
NatureWire protocolAPI specification + data modelFirmware implementation
Latest versionv2.0 Errata 7 (Apr 2015) — frozenDSP0266 v1.23.1 / Schema 2025.4v2.18.0 (May 2025)
TransportUDP port 623, binaryHTTPS, JSON, OData v4Implements both IPMI and Redfish
SecurityFundamentally broken (Cipher 0, hash leak)TLS mandatory, OAuth, MFA, SPDMInherits Redfish + measured boot, TPM
GPU supportNoneProcessor/GPU resources, PCIeDevice, FabricVia Redfish schemas in bmcweb
CXL supportNoneVirtualCXLSwitch (2025.2), Device Types 1/2/3Via Redfish
ScalabilityPoor (low bandwidth, limited sessions)Excellent (stateless REST, SSE, aggregation)Excellent (bmcweb aggregation)
Industry trendDeclining, being deprecatedAscending — de facto standardAscending — hyperscaler adoption

2. OCP standards are defining the AI data center blueprint

The Open Compute Project has become the central standards body for AI-era infrastructure. At the 2024 OCP Global Summit (7,047 attendees, a record), AMD, ARM, and NVIDIA joined the OCP Board, and the Open Data Center for AI strategic initiative launched with backing from Meta, Microsoft, Google, and all major silicon vendors.

Hardware management specifications

OCP's Hardware Management Project has designated Redfish as the out-of-band management interface for all OCP-compliant platforms. Key specifications include OCP Redfish Interoperability Profiles (published on GitHub at HWMgmt-OCP-Profiles), the Hardware Management Module (HMM) for standardized BMC system-on-module designs, and the Firmware Update Specification for cross-platform update requirements.

The Data Center Modular Hardware System (DC-MHS) specifications, approved in November 2022, define interoperable modular building blocks: full-width host processor modules (M-FLW), density-optimized modules (M-DNO), platform infrastructure connectivity (M-PIC), common redundant power supplies (M-CRPS), and extended I/O connectivity (M-XIO). DC-SCM 2.0 standardizes the Secure Control Module — the physical BMC daughter card with a standardized connector. Dell's iDRAC10 follows this DC-MHS/DC-SCM architecture.

Security through Caliptra and Cerberus

Caliptra, the open-source silicon root of trust founded by AMD, Google, Microsoft, and NVIDIA under the CHIPS Alliance, has matured rapidly. Caliptra 2.1 adds quantum-resilient cryptography (NIST module-lattice-based digital signatures). It provides DICE-as-a-Service, SPDM responder signing, and firmware authentication using an embedded RISC-V VeeR core. Both Google and Microsoft have committed to integrating Caliptra in first-party cloud silicon, and AMD has committed for server silicon products.

Project Cerberus (Microsoft) provides platform-level firmware protection, detection, and recovery — intercepting host-to-flash SPI bus accesses for continuous firmware integrity measurement. Together, Caliptra (silicon-internal RoT) and Cerberus (platform-external RoT) form a layered security architecture now backed by the OCP S.A.F.E. conformance program with certified Security Review Providers.

The FTI neocloud workstream

The OCP Future Technologies Initiative's Scaling AI Clusters at Neoclouds workstream, co-led by representatives from Denvr Dataworks, FarmGPU, and Scaleway, is directly addressing neocloud challenges: acquiring power, procuring hardware, open standards for AI infrastructure modularity, and energy efficiency. The workstream has published Ethernet-based training and inference fabric reference architectures with connectivity maps, BOMs, and Kubernetes CRDs for clusters from 64 to 1,024 GPUs.


3. Vendor solutions: the management capabilities that actually matter

Dell iDRAC — the broadest ecosystem with an OpenBMC future

Dell's iDRAC10 (PowerEdge 17th generation) represents a significant architectural shift: built on OpenBMC foundations with a 4-core 1GHz Nuvoton Arbel SoC, 2GB DDR4, and FIPS 140-3 certification. It follows the OCP DC-MHS/DC-SCM architecture and introduces rebootless updates for iDRAC itself, NVMe SSDs, backplanes, PERC controllers, and select GPUs. An integrated Security Enclave manages root-of-trust, BIOS scanning, and device attestation.

Dell's fleet management platform, OpenManage Enterprise (OME) v4.6, scales to 8,000 devices for full lifecycle management or 25,000 devices for monitoring-only. The Redfish API is included in all license tiers, though virtual console, Group Manager (limited to 250 nodes), and telemetry streaming (180+ metrics via SSE) require Enterprise or Datacenter licenses at additional cost. Dell publishes extensive Python and PowerShell Redfish scripting libraries on GitHub, and the dellemc.openmanage Ansible collection (v10.0.1) provides Red Hat certified automation modules.

For GPU servers, Dell's PowerEdge XE series includes the XE9680 (8-GPU, H100/H200/B200), XE9780 (17G air-cooled AI server with iDRAC10), and the XE8712 delivering up to 144 Blackwell GPUs per rack with an Integrated Rack Controller for advanced thermal management.

Key weakness: OME's 8,000-device full management ceiling and 25,000-device monitoring ceiling may be insufficient for large neocloud deployments. There is no cloud-native SaaS management platform equivalent to HPE's GreenLake Compute Ops Management.

HPE iLO — the Redfish gold standard with quantum-ready security

HPE was a founding contributor to the Redfish standard, shipping a proprietary RESTful interface with iLO 4 before the spec was even published. Their implementation is widely regarded as the most standards-compliant in the industry. iLO 7 (ProLiant Compute Gen12, announced February 2025) introduces a dedicated security enclave processor, FIPS 140-3 Level 3 certification, and quantum-resistant LMS firmware signatures — industry firsts for a server management processor.

HPE's developer ecosystem is exceptionally strong: the open-source iLOrest CLI, SDKs in five languages, interactive Jupyter Notebook workshops, and comprehensive API documentation at developer.hpe.com. GreenLake Compute Ops Management provides cloud-native SaaS lifecycle management with sustainability tracking and automated onboarding for distributed locations.

Key weakness: HPE OneView scales to only 740 servers per appliance (Global Dashboard extends to 25 appliances but adds complexity). Per-server licensing for both iLO Advanced and OneView Advanced creates significant cost at scale. The ecosystem spans OneView (on-prem), Compute Ops Management (cloud), and legacy Amplifier Pack, creating management plane fragmentation.

Supermicro — dominant in GPU servers, troubled in BMC quality

Supermicro's hardware portfolio is unmatched for GPU workloads: they're typically first-to-market with NVIDIA GPU platforms, offer the broadest range of form factors, and price 10–30% below Dell/HPE for equivalent specifications. Their A+ GPU server line dominates AI training deployments.

However, Supermicro's BMC firmware quality is a significant liability. Built on ASPEED AST2600 with AMI MegaRAC firmware, their Redfish implementation is measurably less mature than Dell's or HPE's. The check_redfish monitoring tool reports compatibility breakage across BMC firmware versions. Red Hat documented provisioning failures with Supermicro X12 via Redfish. Full Redfish functionality requires the SFT-DCMS-SINGLE license per node — an additional cost that Dell doesn't impose for basic Redfish access.

The security track record is alarming. In October 2023, Binarly disclosed seven CVEs (including CVE-2023-40289 at CVSS 9.1 for root access via command injection), with 70,000+ internet-exposed Supermicro IPMI interfaces discovered. In July 2024, NVIDIA's Offensive Security Research Team found CVE-2024-36435 (CVSS 9.8) — an unauthenticated remote code execution vulnerability affecting X11 through B13 motherboards. Virtual media vulnerabilities exposing plaintext authentication have also been documented.

Lenovo XClarity — quietly adopting OpenBMC

Lenovo's XCC3 (ThinkSystem V4) represents the most significant enterprise vendor shift to OpenBMC on ASPEED AST2600. Redfish Schema Bundle 2024.3 compliance and published Python/PowerShell scripting libraries on GitHub make it a strong standards-compliant option. XClarity Administrator (LXCA) v4.3 provides fleet management for up to 300 devices per appliance with agent-free hardware management, centralized firmware compliance, and bare-metal OS provisioning.

NVIDIA — the most complex management stack in data centers

NVIDIA DGX systems use OpenBMC-based firmware with a unique dual-BMC architecture: a Host BMC managing the CPU tray and an HGX BMC (HMC) managing the GPU tray. The HGX BMC exposes NVSwitch data via Redfish at /redfish/v1/Fabrics/HGX_NVLinkFabric_0/Switches/, GPU health metrics via EnvironmentMetrics URIs, and power management through a Node Manager API with power domains and policies.

DCGM (Data Center GPU Manager) is the critical GPU monitoring layer, providing metrics through a modular daemon architecture (nv-hostengine) with specialized modules for NVSwitch, health, diagnostics, and profiling. DCGM Exporter exposes 80+ GPU metrics to Prometheus on port 9400. Key metrics include SM clock frequency, GPU/memory utilization, temperature, power draw, ECC errors (SRAM/DRAM correctable and uncorrectable), XID error codes, NVLink bandwidth counters, and row remapping status. DCGM sits on top of NVML and adds group management, policy enforcement, active diagnostics (4 levels from quick deployment check to extended stress tests), and profiling metrics (SM activity, Tensor Core utilization) unavailable through nvidia-smi alone.

Fabric Manager is the essential daemon for NVSwitch-based systems. It configures NVLink memory fabrics, manages GPU routing and NVLink port mapping, and continuously monitors fabric health. For 4th-generation NVSwitch (DGX B200/B300, GB200 NVL72), the NVLink Subnet Manager (NVLSM) handles topology discovery and forwarding table programming. The GB200 NVL72's 72-GPU NVLink domain is managed through NMX (NVLink Management Software): NMX-Controller for global fabric management via gRPC, and NMX-Telemetry for centralized monitoring with Prometheus-compatible REST endpoints.

BlueField DPUs (BlueField-2/3) add another management layer with their own integrated BMC (supporting full Redfish API), the rshim host-side interface for firmware provisioning, and DOCA SDK for programming. Mode switching between DPU, NIC, and zero-trust modes is controlled via Redfish BIOS settings on the DPU's BMC.

Base Command Manager (BCM), formerly Bright Cluster Manager, provides cluster-level provisioning, monitoring, and workload management with Slurm and Kubernetes integration. BCM offers a free license for up to 8 accelerators per system at any cluster size, with enterprise support purchased separately.

ODM/OEM vendors and the AMI MegaRAC crisis

Taiwan-based ODMs dominate GPU server manufacturing: Foxconn (~24% share), Inventec (~22%), Quanta/QCT (~15%), and Wiwynn (~1% but growing rapidly with 50%+ revenue from Meta). Nearly all use ASPEED AST2600 BMC SoCs with AMI MegaRAC firmware — the same firmware at the center of a systematic security crisis.

AMI MegaRAC vulnerabilities represent the most urgent security issue in data center hardware. The timeline is devastating: CVE-2022-40259 (CVSS 9.9, remote code execution via Redfish), CVE-2023-34329 (authentication bypass via HTTP header spoofing), and the crown jewel — CVE-2024-54085 (CVSS 10.0), an authentication bypass affecting products from HPE, ASUS, ASRock, Lenovo, NVIDIA, Supermicro, and others. This vulnerability was added to CISA's Known Exploited Vulnerabilities catalog in June 2025 — the first BMC vulnerability ever to reach that list, confirming active exploitation in the wild. Censys observed 4,110+ exposed MegaRAC instances on the public internet.

Vendor comparison matrix

CapabilityDell iDRAC10HPE iLO 7SupermicroLenovo XCC3NVIDIA DGX
BMC platformOpenBMC (Nuvoton Arbel)Proprietary (secure enclave)AMI MegaRAC (AST2600)OpenBMC (AST2600)OpenBMC (custom)
Redfish maturityStrong + extensive OEMIndustry bestInconsistent across FW versionsGood, improvingGood for power/lifecycle
Fleet scaleOME: 8K/25KOneView: 740/appliance; COM: SaaSSSM: 10K (Linux)LXCA: 300/applianceBCM: unlimited
GPU telemetryVia Redfish (basic)Via Redfish (basic)Requires DCGM/nvidia-smiVia Redfish (basic)DCGM native (80+ metrics)
Security postureFIPS 140-3, SHA-384/512FIPS 140-3 Level 3, quantum-resistantHistory of critical CVEsOpenBMC audit capabilityERoT per component
Licensing burdenEnterprise/Datacenter per-serveriLO Advanced + OneView per-serverSFT-DCMS per node for RedfishStandard/Premier + LXCABCM free up to 8 GPUs
Neocloud fitGood (scale, ecosystem)Good (Redfish, security)Common (price, GPU variety)ModerateBest for DGX SuperPOD

4. Bare metal provisioning for the GPU era

The provisioning platform landscape

MAAS (Metal as a Service) by Canonical, currently at v3.7.2, provides the most mature open-source bare metal lifecycle management. Its two-tier architecture (Region Controller for central management, Rack Controller for per-segment DHCP/TFTP/PXE) scales to ~1,000 machines per rack controller with PostgreSQL-backed state management. MAAS 3.7 notably added NVIDIA BlueField-3 DPU support, treating DPUs as first-class managed devices. The full node lifecycle — New → Commissioning → Ready → Allocated → Deployed → Released — integrates with Juju, Ansible, Terraform, and Kubernetes.

OpenStack Ironic (current release: 2025.1 "Epoxy") offers the most flexible driver architecture, supporting IPMI, Redfish, iDRAC, iLO, and vendor-specific interfaces through pluggable driver modules. The 2025.1 release introduced a bootc deploy interface for OCI container images and native container registry support for deployment artifacts. Ironic runs standalone (without full OpenStack) via Bifrost or Metal3 (CNCF Incubated, Kubernetes-native). Redfish Virtual Media boot eliminates PXE/TFTP infrastructure entirely — the BMC fetches a boot ISO via HTTP and mounts it as a virtual CD/DVD drive.

Tinkerbell (CNCF Sandbox) provides a Kubernetes-native workflow engine: Smee handles DHCP/iPXE, Tootles serves EC2-like metadata, HookOS boots an in-memory LinuxKit environment, and Tink Server orchestrates workflow actions delivered as container images. Rufio provides BMC interaction through Kubernetes CRDs using the bmclib abstraction library.

Warewulf 4 (v4.6.0, March 2025) dominates HPC stateless/diskless provisioning, now supporting OCI container images as node OS definitions — enabling CI/CD pipelines for infrastructure images. Created by Greg Kurtzer (also behind Rocky Linux), Warewulf scales to tens of thousands of nodes.

The boot protocol evolution

The industry is migrating from PXE/TFTP (slow, UDP-based, no encryption) through iPXE (HTTP/HTTPS with scripting) toward UEFI HTTP Boot (native firmware HTTP support without chainloading) and ultimately Redfish Virtual Media (BMC-initiated HTTP fetch, zero network boot infrastructure required). Each generation eliminates infrastructure complexity while improving security and reliability.

GPU cluster provisioning challenges

Provisioning GPU nodes is fundamentally harder than commodity servers. OS images with NVIDIA drivers, CUDA toolkit, cuDNN, NCCL, and ML frameworks can exceed 20–50GB. Multi-component firmware alignment is critical — a DGX H100 contains dozens of firmware components (SBIOS, BMC, VBIOS per GPU, NVSwitch firmware, ERoTs, FPGAs, CPLDs, PCIe retimers and switches, PSU firmware, NVMe firmware, NIC firmware), each with dependency chains and specific update tools. Post-install validation requires GPU health checks, ECC memory verification, and NVLink topology confirmation.

Revenue impact is substantial: a $10M GPU cluster losing days of provisioning time represents millions in lost revenue at current GPU-hour pricing. Solutions include image-based deployment (Warewulf containers), HTTP boot with local image caching, parallel provisioning across racks, and pre-staging firmware bundles during commissioning.


5. Firmware updates: the multi-vendor coordination nightmare

Firmware management at scale is where the abstraction layers break down. LVFS (Linux Vendor Firmware Service) and fwupd provide a standards-based update mechanism — the daemon discovers hardware, checks LVFS for updates, and applies them via UEFI Capsule or vendor-specific protocols. Nearly every major vendor participates (Dell, Lenovo, HP, Intel), with millions of devices updated monthly. However, server adoption lags consumer hardware, and Redfish plugin support varies.

NVIDIA provides dedicated tools: nvfwupd for DGX/HGX systems (supporting Redfish API and host CLI, capable of parallel GPU tray updates), nvflash for individual GPU VBIOS updates, and mlxup/mlxfwmanager for ConnectX NIC firmware. Each NIC has a unique PSID identifying the OEM variant, and firmware must match — Dell, Lenovo, and HPE variants of the same ConnectX adapter require different firmware images.

The OCP GPU Firmware Update Specification documents the core challenge: GPU firmware staging takes approximately 15 minutes, plus activation time. Currently, staging and activation must happen "back-to-back on a relatively short time scale," meaning customer downtime covers the entire duration. Hyperscalers can stage firmware during operation and perform gang activation later, but this risks accidental early activation on rebooted nodes. The specification requires firmware update paths without intermediate versions (minimum: supported path from any version released within 6 months) and explicit ERoT/iRoT rollback policies that create tension between security (preventing firmware downgrades) and operational needs (rollback on failure).

Network switch firmware follows a separate path entirely: ONIE (Open Network Install Environment) provides standardized NOS installation for white-box switches, supporting SONiC, Cumulus Linux, and other operating systems via DHCP-triggered HTTP/TFTP delivery.


6. Monitoring and telemetry: building observability for GPU infrastructure

The hardware telemetry stack

Redfish Telemetry Service (introduced in v1.6.0/2018.2) provides MetricReports with configurable collection schedules (Periodic, OnChange, OnRequest) and dual delivery modes — pull via GET or push via SSE/EventService subscriptions. The 2025.2 release added TelemetryData for bulk device telemetry retrieval without BMC decode overhead. OCP's 2025 Global Summit featured streaming telemetry work introducing TimeSeriesRecord schemas with time-label-value records for high-frequency streaming.

GPU monitoring requires DCGM Exporter (latest: 4.5.2-4.8.1), a Go binary wrapping the DCGM C API that exposes /metrics on port 9400 for Prometheus scraping. Key metrics include DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_XID_ERRORS, DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, and NVLink bandwidth counters. The official Grafana dashboard (ID 12239) provides standard visualization. For AMD GPUs, amd-smi (v26.2.2) provides equivalent capabilities with the AMD Device Metrics Exporter for Prometheus integration.

Understanding GPU error taxonomy is essential for neocloud operators. XID errors are NVIDIA driver error reports in kernel logs, ranging from informational (Xid 94: contained correctable ECC) through warning (Xid 13: application fault) to fatal (Xid 79: "GPU has fallen off the bus"). ECC errors span SRAM (on-chip cache) and DRAM (HBM) in both correctable and uncorrectable categories — more than 4 aggregate SRAM uncorrectable errors typically warrants RMA. SXid errors are NVSwitch-specific; a fatal SXid causes Fabric Manager to abort all running CUDA jobs. Row remapping (Ampere+) automatically handles failing memory rows, replacing the older Dynamic Page Retirement mechanism.

Choosing the right observability backend

Standard Prometheus handles approximately 100,000 active time series per 4 vCPU shard. A 1,000-node GPU cluster with 8 GPUs each generates 640,000+ GPU metric series alone (80 metrics × 8,000 GPUs), requiring 6+ Prometheus shards before counting node-level metrics. VictoriaMetrics handles up to 100 million active time series on a single node with 10x better data compression and 5x lower memory consumption — users like CERN's CMS collaboration and Grammarly report 10x cost reductions. The recommended pattern: Prometheus scrapes, VictoriaMetrics stores.

For long-term retention and multi-cluster views, Thanos (CNCF Incubating) extends Prometheus with object storage backends and global query federation. Grafana Mimir provides horizontally scalable ingestion with sharded query engines for high-cardinality workloads. The Elastic Stack handles log aggregation for BMC logs, kernel hardware errors, and SEL events — essential for correlating XID error sequences across a fleet.

DCIM as the infrastructure source of truth

NetBox (now v4.x, maintained by NetBox Labs) has become the de facto open-source DCIM/IPAM, modeling regions, sites, racks, devices, interfaces, cables, IP prefixes, VLANs, and circuits with a full REST API and GraphQL. Nautobot (forked from NetBox v2.10.4 by Network to Code) adds native Git integration, a built-in workflow engine, and MySQL support. For neocloud operations, DCIM integration with provisioning automation (Ansible/Terraform driven from NetBox API) enables rack-to-production workflows, power budget tracking, and capacity planning.

Automated fault remediation

NVIDIA NVSentinel (v1.0.0, Apache 2.0) provides production-ready automated fault remediation for Kubernetes GPU clusters. Its microservice architecture includes GPU Health Monitor (DCGM integration), Syslog Health Monitor, and a Health Events Analyzer that classifies severity and triggers automated cordon → drain → diagnostics → remediation workflows. The system reportedly reduces detection-to-remediation from hours to seconds. The NVIDIA Xid Catalog provides machine-readable fault resolution buckets mapping each error code to specific actions: RESET_GPU, REBOOT_NODE, DRAIN_AND_INSPECT, or RMA.


7. Orchestration: from job schedulers to infrastructure-as-code

GPU cluster workload management

Slurm (v25.11) remains the dominant workload manager for AI training clusters. Its GPU scheduling through GRES (Generic Resources) provides topology-aware placement, while the PowerSave plugin enables automatic power management of idle nodes via configurable suspend/resume programs that can invoke IPMI or Redfish commands. Version 25.11 introduces Hierarchical Resources (including power-capping Mode 3), Expedited Requeue for automatic job recovery on node failure, and native Prometheus metrics endpoints.

The Kubernetes + NVIDIA GPU Operator stack deploys GPU device plugins, driver containers, DCGM Exporter, MIG Manager, and GPU Feature Discovery as a unified Helm chart. While powerful for inference and microservices, Kubernetes faces limitations for large-scale distributed training where Slurm's tighter integration with MPI and NCCL provides lower overhead.

Infrastructure-as-code for bare metal

The Dell Terraform Redfish Provider (v1.6.0) enables declarative management of server power cycles, BIOS attributes, iDRAC settings, storage volumes, virtual media, firmware updates, and Server Configuration Profile export/import. The golden configuration pattern — export SCP from a reference server, import to fleet members — provides reproducible server configuration at scale.

Ansible provides the broadest BMC automation coverage: community.general modules for vendor-neutral Redfish operations (redfish_info, redfish_command, redfish_config), the Dell dellemc.openmanage collection (v10.0.1) for iDRAC-specific automation, and HPE's hpe.ilo and hpe.oneview collections. A typical bare metal workflow chains hardware configuration (Terraform/Redfish) → OS provisioning (Ironic/MAAS) → post-install configuration (Ansible) → workload orchestration (Slurm/K8s).


8. Critical gaps and pain points, prioritized for neocloud operators

Tier 1: Existential risks

BMC security is the most urgent crisis. AMI MegaRAC CVE-2024-54085 (CVSS 10.0, confirmed active exploitation, CISA KEV catalog) enables unauthenticated remote server takeover, malware deployment, firmware tampering, and physical damage across products from a dozen major vendors. Supermicro has 70,000+ internet-exposed IPMI interfaces with a documented history of critical vulnerabilities. NSA issued dedicated BMC hardening guidance in June 2023. Every neocloud must immediately audit BMC network exposure, enforce credential rotation, segment management networks, and establish firmware patching SLAs with vendors.

Firmware update complexity at scale is the second existential risk. A single DGX H100 node contains dozens of firmware components with dependency chains, mandatory staging times (~15 minutes per GPU firmware update), version-specific upgrade paths, and ERoT rollback policies that conflict with operational recovery needs. Dell community forums document scenarios where BMC firmware corruption creates update deadlocks requiring board replacement. Coordinating multi-vendor firmware (BIOS + BMC + NIC + GPU + switch) across hundreds of nodes without production disruption requires custom orchestration that most neoclouds must build from scratch.

Tier 2: Operational blockers

The neocloud tooling gap is the defining operational challenge. Hyperscalers employ thousands of SREs and build custom management platforms internally. Enterprise tools (Dell OME at 8K nodes, HPE OneView at 740/appliance) were designed for heterogeneous enterprise environments, not cloud-scale homogeneous GPU fleets. There is no commercially available equivalent to hyperscaler internal platforms. McKinsey (2025) notes that "neoclouds originally emerged as stopgaps to address the GPU shortage, but their BMaaS economics are fragile" — average GPU cluster utilization sits at approximately 40%, with 60 cents of every dollar wasted on idle time.

GPU management standardization is years behind deployment reality. OCP's GPU & Accelerator Management Interfaces specification (v0.9/v1.0/v1.1) defines PLDM over MCTP for BMC-to-GPU communication — not Redfish directly. A question posted to the Redfish Specification Forum asks "how to model the NVLink interconnection between GPUs" for GB200 systems, confirming this gap remains unresolved. NVLink Switch management on GB200 "MUST be done via NVOS" — bypassing Redfish entirely. Each GPU vendor (NVIDIA DCGM, AMD ROCm SMI) uses proprietary telemetry with different metric names, collection methods, and error taxonomies.

Redfish adoption inconsistency across vendors defeats the standard's interoperability promise. Dell uses proprietary OEM extensions for boot order management and Server Configuration Profiles. HPE uses custom CHIF interfaces alongside standard Redfish. Supermicro's AMI MegaRAC firmware reports inventory under Oem.Ami.FirmwareInventory. NVIDIA DGX BMCs follow DSP0266 v1.7.0 with Redfish Schema 2019.1 — six years behind current specifications. The OpenStack Ironic Sushy library explicitly documents the need for vendor-specific code paths for OEM extensions, and the fwupd project requires per-vendor BMC reset behavior flags.

Tier 3: Scale and efficiency challenges

BMC polling limitations derive from fundamental hardware constraints. The ASPEED AST2600 (dual-core ARM Cortex-A7 at 1.2GHz with typically 512MB–1GB RAM) runs a full HTTPS/TLS web server. NVIDIA DGX release notes document BMC usage spikes causing POST failures. HTTPS/TLS termination on these constrained SoCs limits concurrent connections to single digits. The Redfish push model (SSE/event subscriptions) helps but requires BMCs to maintain subscription state.

Observability at hyperscale faces cardinality explosion: 1,000 nodes × 8 GPUs × 80+ metrics generates 640,000+ time series for GPU metrics alone, before accounting for per-NVLink-lane, per-HBM-stack, and node-level metrics. Prometheus requires 6+ shards. VictoriaMetrics or Thanos is effectively mandatory, adding architectural complexity and storage costs.

Day-2 operations automation maturity is the final major gap. Most tooling handles Day-0 (procurement) and Day-1 (deployment) reasonably well. Day-2 (operate, maintain, update, troubleshoot) remains largely manual. Hardware failures, driver incompatibilities, and configuration drift are expected operational conditions at scale — Canonical documents these as "not exceptional events but expected operational conditions." NVIDIA NVSentinel and AMD GPU Operator's Argo Workflows represent early automated remediation, but comprehensive Day-2 automation for heterogeneous GPU fleets doesn't exist as a product.


9. Emerging trends reshaping the next three years

CXL will transform memory management

CXL 3.1 (November 2023) introduced port-based routing for any-to-any fabric communication and Global Integrated Memory for host-to-host communication. CXL 3.2 (December 2024) optimizes memory device monitoring. CXL 4.0 doubles bandwidth to 128 GT/s. Redfish 2025.2's VirtualCXLSwitch schema, co-developed with the CXL Consortium, begins standardizing management of CXL fabrics that can span 4,096 nodes. CXL enables memory pooling (eliminating stranded memory), composable disaggregated infrastructure, and multi-tenant memory sharing with the Trusted Security Protocol. This fundamentally changes how neoclouds can allocate and share resources.

Post-quantum security is arriving in BMCs

HPE iLO 7 already ships with quantum-resistant LMS firmware signatures. Caliptra 2.1 adds NIST MLDSA-87 post-quantum signature verification. SPDM 1.3.1 provides device-to-device authentication that integrates with Redfish SecurityMode (2025.1) and automatic certificate enrollment via SCEP/ACME (2025.3). The full attestation chain — from silicon root of trust through firmware integrity to runtime measurement — is becoming production-ready across the ecosystem.

AI-driven predictive failure is production-proven

Meta's Llama 3 training suffered 466 job interruptions in 54 days, with 78% caused by hardware issues. HPE InfoSight claims to predict and resolve 86% of issues before impact by analyzing trillions of data points from 100,000+ systems. Dell CloudIQ provides free SaaS-based anomaly detection for PowerEdge servers. NVIDIA launched a GPU Telemetry Software service in 2025 with an open-source client agent. For neoclouds, implementing hardware risk scoring — weighted sums of ECC error rates, thermal throttle frequency, firmware drift, and allocation retry counts — provides actionable failure prediction without hyperscaler-scale ML infrastructure.

Liquid cooling management is being standardized

With AI racks reaching 100–150 kW (and Google discussing 1 MW IT racks), liquid cooling is becoming mandatory. OCP's Advanced Cooling Solutions workstream covers cold plate, immersion, CDU, and door heat exchangers. Google's Project Deschutes CDU design achieved ~99.999% fleet availability since 2020. Redfish is adding liquid cooling management: CDU controls in 2024.4, thermal equipment subsystem messages in 2025.2, and valve/leak detector messages in 2025.3. First vendor implementations shipped in early 2025.

OpenBMC consolidation is accelerating

The market is converging on OpenBMC as the common firmware foundation. Dell (iDRAC10), Lenovo (XCC3), NVIDIA (DGX), and all major hyperscalers have adopted it. AMI is pivoting from proprietary MegaRAC to providing enterprise support and SLAs on top of OpenBMC, with their MegaRAC Community Edition as the first OCP S.A.F.E.-compliant OpenBMC distribution. The OpenBMC market is projected to reach $4.8 billion by 2033. For neoclouds, this consolidation means a more uniform management interface across vendors and the ability to contribute upstream fixes that benefit the entire fleet.


Conclusion: a strategic playbook for neocloud hardware management

The hardware management landscape is at an inflection point. Three concurrent transitions — IPMI to Redfish, proprietary BMC firmware to OpenBMC, and enterprise tooling to cloud-native automation — create both risk and opportunity for neocloud operators. The operators who build robust, automated management stacks today will achieve the GPU utilization rates (targeting >80%, up from the current ~40% industry average) that determine economic viability.

Immediate priorities should be BMC security hardening (network segmentation, credential management, firmware patching SLAs), standardized Redfish-based automation using Ansible and Terraform with vendor-specific collections, and DCGM-based GPU health monitoring with automated fault remediation through NVSentinel or custom tooling. Medium-term investments should target unified firmware lifecycle management (declarative firmware-as-code with compliance tracking), VictoriaMetrics or Thanos for observability at scale, and NetBox-driven DCIM as the infrastructure source of truth. Strategic bets include engaging with OCP's Scaling AI Clusters at Neoclouds workstream, preparing infrastructure for CXL memory pooling, and contributing to OpenBMC to ensure your hardware variants are well-supported upstream.

The most important insight from this research: the gap between what hyperscalers have built internally and what's available commercially is the single largest competitive moat in cloud infrastructure. The neoclouds that close this gap — through internal tooling, open-source contribution, or strategic vendor partnerships — will survive the coming market consolidation. Those that don't will find their unit economics crushed by the operational overhead of managing GPU infrastructure at scale without adequate automation.