About the Book
Build InfiniBand fabrics that stay fast, stable, and observable for AI and HPC workloads.
Running InfiniBand at production scale is hard. Physical plant, routing policy, partitions, congestion control, and day two operations all interact. Teams need a practical reference that connects design choices to the counters, commands, and workflows used in real clusters.
Practical InfiniBand Administration provides that reference. It moves from RDMA fundamentals to fabric design and bring up, then into monitoring, troubleshooting, and workload integration with UCX, NCCL, SHARP, Kubernetes, and OpenShift, so you can keep throughput high and variance low.
clear rdma object model, pd qp cq mr, address handles, with a clean qp lifecycle using rdmacm
throughput math operators use, sdr hdr ndr, lanes, mtu, payload efficiency that matches real links
cabling and optics choices, qsfp56 qsfp112 osfp twin port, aoc dac acc, mpo polarity and bend radius
opensm configuration that sticks, routing engines minhop ftree updn torus 2 qos and predictable sweeps
partitions and qos in practice, p keys, sl to vl mapping, arbitration that protects storage and control traffic
nvidia congestion control deployed safely, policy files, keys, staged rollout, and validation on live clusters
adaptive routing and shield, enablement steps with acceptance tests that prove path dispersion
ib routers for multi subnet scale out, practical design patterns and known limits
fabric design and bring up, spine leaf fat tree planning, oversubscription math, baseline and burn in gates
day 0 to day 2 operations, ibdiagnet ibnetdiscover perfquery ibtracert used as an actionable toolkit
firmware and driver lifecycle, repeatable workflows with mlxfwmanager and mstflint
troubleshooting playbooks, physical faults with symbolerrors and linkdowned, xmitwait analysis, path checks and sweep timing
ai and hpc integration, ucx and hpc x transport selection, parameters, and a measurement strategy that holds up
nccl over infiniband, essential environment variables, topology awareness, and stable channel plans
sharp collectives offload, setup, orchestration, and verification on both mpi and nccl stacks
platform features, gpudirect rdma with dma buf checks, ipoib connected mode tuning, and practical sysctls
virtualization with sr iov on connectx, host and hypervisor configuration, and guest level validation
kubernetes and openshift patterns, nvidia network operator and doca ofed with a nicclusterpolicy that matches the fabric
rdma shared device plugin and cdi, device exposure inside pods, and sriov infiniband cni with multus attachments
This is a code heavy guide with working Bash, C, YAML, and JSON snippets that map directly to bring up, validation, and on call workflows.
Add a reliable InfiniBand playbook to your toolkit, get your copy today.